Online Networking Architectures | The New Stack https://thenewstack.io/networking/ Wed, 14 Jun 2023 17:34:12 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 Red Hat Launches OpenStack Platform 17.1 with Enhanced Security https://thenewstack.io/red-hat-launches-openstack-platform-17-1-with-enhanced-security/ Wed, 14 Jun 2023 17:34:12 +0000 https://thenewstack.io/?p=22711054

VANCOUVER — At OpenInfra Summit here, , announced the impending release of its OpenStack Platform 17.1. This release is the

The post Red Hat Launches OpenStack Platform 17.1 with Enhanced Security appeared first on The New Stack.

]]>

VANCOUVER — At OpenInfra Summit here, Red Hat, announced the impending release of its OpenStack Platform 17.1. This release is the product of the company’s ongoing commitment to support telecoms as they build their next-generation 5G network infrastructures.

In addition to bridging existing 4G technologies with emerging 5G networks, the platform enables advanced use cases like 5G standalone (SA) core, open virtualized radio access networks (RAN), and network, storage, and compute functionalities, all with increased resilience. And, when it comes to telecoms, the name of the game is resilience. Without it, your phone won’t work, and that can’t happen.

Runs On OpenShift

The newest version of the OpenStack Platform runs on Red Hat OpenShift, the company’s Kubernetes distro. Under this, Red Hat Enterprise Linux (RHEL) 8.4 or 9.2 runs. This means it can support logical volume management partitioning, and Domain Name System as a Service (DNSaaS).

The volume management partition enables short-lived snapshot and reverts functionalities. This enables service providers to revert back to a previous state during upgrades if something goes wrong. Of course, we all know that everything goes smoothly during updates and upgrades. Not.

This take on DNSaaS includes a framework for integration with Compute (Nova) and OpenStack Networking (Neutron) notifications, allowing auto-generated DNS records. In addition, DNSaaS includes integration support for Bind9.

Other Improvements

Red Hat also announced improvements to the Open Virtual Networking (OVN) capabilities, Octavia load balancer, and virtual data path acceleration. These enhancements ensure higher network service quality and improved OVN migration time for large-scale deployments.

OpenStack Platform 17.1 continues its legacy of providing a secure and flexible private cloud built on open source foundations. This latest release offers role-based access control (RBAC), FIPS-140 (ISO/IEC 19790) compatibility, federation through OpenID Connect, and Fernet tokens, ensuring a safer, more controlled IT environment.

Looking ahead to the next version, Red Hat software engineers are working on making it much easier to upgrade its OpenStack distro from one version to the next. Historically, this has always been a major headache for all versions of OpenStack. Red Hat’s control plane-based approach, a year or so in the future, sounds very promising.

The post Red Hat Launches OpenStack Platform 17.1 with Enhanced Security appeared first on The New Stack.

]]>
WithSecure Pours Energy into Making Software More Efficient https://thenewstack.io/withsecure-pours-energy-into-making-software-more-efficient/ Thu, 01 Jun 2023 14:14:07 +0000 https://thenewstack.io/?p=22709522

WithSecure has unveiled a mission to reduce software energy consumption, backing research on how users trade off energy consumption against

The post WithSecure Pours Energy into Making Software More Efficient appeared first on The New Stack.

]]>

WithSecure has unveiled a mission to reduce software energy consumption, backing research on how users trade off energy consumption against performance and developing a test bench for measuring energy use, which it ultimately plans to make open source.

The Finnish cyber security firm has also kicked off discussions on establishing standards for measuring software power consumption with government agencies in Finland and across Europe, after establishing that there is little in the way of guidance currently.

Power Consumption

Power consumption by backend infrastructure is a known problem. Data centers, for example, account for up to 1.3% of worldwide electricity consumption, according to the International Energy Agency. While this figure has stayed relatively stable in recent years, it excludes the impact of crypto mining, which accounts for almost half as much.

A report for the UK Parliament last year cited estimated that user devices consume more energy than networks and data centers combined.

Leszek Tasiemski, WithSecure’s vice president for product management, spoke at Sphere 2023 in Helsinki, saying that most of the firm’s own operations run in the cloud, which gives it good visibility into the resources it was using and their CO2 impact.

Most of the data centers it uses already run on renewable energy sources, he said, and it was already “optimizing the code as much as we can so that it performs less operations. Or it performs the same operations with a different approach or a different programming language or different libraries so that it results in less CPU cycles, less I/O operations.”

It is harder to have an impact on power consumption outside of the platforms it directly controls, says Tasiemski, but the firm is working to optimize the agent software its clients run on their systems.

The energy consumption of the WithSecure agent, which runs on clients’ devices, might be relatively small, but Tasiemski said, “This is where we have economies of scale. We have millions of devices out there.”

This would benefit the users, he said. “This is not for our direct benefit. It’s not our electricity bills, it’s not our heat to remove.” He added that, as for its own systems, lowering energy usage usually means better performance. “It’s not always black and white, but it’s related.”

The Challenge

The challenge is how to do this without compromising security. Users can vary settings in WithSecure’s profile editor, for example, how often scans are run. Optimizing or adjusting these could be used to reduce resource use. But this could also be dangerous if admins are so focused on energy reduction that they dial things back too far.

So it has kicked off research at Poland’s Poznan University of Technology to examine how general users and security pros are likely to visualize energy consumption versus risk appetite. “We are doing this research to see how we can, in a responsible way, show this information,” said Tasiemski.

Tasiemski said another problem is that there aren’t many standards for measuring energy consumption by software, so WithSecure intends to meet with government organizations and institutions to try to kickstart a conversation. There is no tangible work at present, either in Finland or the European Commission. He said there seems to be some work going on in France, so he is trying to contact the relevant organizations there.

“In the case of software, it’s incredibly hard to figure out common standards for energy efficiency. We have it for buildings. Buildings are also not the easiest thing to measure, so I think it can be done.”

He said there was no direct commercial object to this. “I absolutely don’t mind if somebody steals our idea. We do the research; it will be open to everybody. So if other companies would like to use it, yeah, go ahead.”

Likewise, he said, WithSecure has built a test bench for measuring energy usage of software. It has been using this since January to measure the power consumption of its ongoing agent releases by modeling typical user behavior. The goal is to establish a baseline against which it can measure progress in reducing consumption over time.

“I absolutely wouldn’t mind open sourcing that because this is not our core business, and it’s only for the greater good.” He said the biggest brake on making it open source so far is that it was still being tweaked and he wanted to be sure the documentation was good enough.

But ultimately, making such tools open source was the right thing to do, he said. “It doesn’t make sense if every company builds things like that on their own because it’s going to be built in a different way. Classical reinventing the wheel.” And that would be a waste of resources in itself.

The post WithSecure Pours Energy into Making Software More Efficient appeared first on The New Stack.

]]>
Don’t Force Containers and Disrupt Workflows https://thenewstack.io/dont-force-containers-and-disrupt-workflows/ Thu, 25 May 2023 22:10:20 +0000 https://thenewstack.io/?p=22709074

How do you allow people to use their technologies in their workflows? The first thing you do is not force

The post Don’t Force Containers and Disrupt Workflows appeared first on The New Stack.

]]>

How do you allow people to use their technologies in their workflows? The first thing you do is not force people to use containers, says Rob Barnes, a senior developer advocate at HashiCorp, in this episode of The New Stack Makers.

Barnes came by The New Stack booth at KubeCon Europe in Amsterdam to discuss how HashiCorp builds intent into Consul so users may use containers or virtual machines in their workflows.

Consul from HashiCorp is one of the early implementations of service mesh technology, writes Jankiran MSV in The New Stack. “It comes with a full-featured control plane with service discovery, configuration, and segmentation functionality. The best thing about Consul is the support for various environments including traditional applications, VMs, containers, and orchestration engines such as Nomad and Kubernetes.”

Consul is, at heart, a networking service that provides identity, for example, in Kubernetes. A service mesh knows about all services across the stack. In Kubernetes, Helm charts get configured to register the services to Consul automatically. That’s a form of intent. Trust is critical to that intent in Kubernetes.

“We can then assign identity — so in a kind of unofficial way, Consul has almost become an identity provider for services,” Barnes said.

In Consul, identity helps provide more granular routing to services, Barnes said. Consul can dictate what services can talk to each other. The intent gets established. A rules-based system, for instance, may dictate what services can talk to each other and which can’t.

“I think that’s an opportunity that HashiCorp has taken advantage of,” Barnes said. “We can do a lot more here to make people’s lives easier and more secure.”

So what’s the evolution of service mesh?

“There’s a lot of misconceptions with service mesh,” Barnes said. “As I say, I think people feel that if you’re using service meshes, that means you’re using containers, right? Whereas, like, I can speak for Consul specifically, that’s not the case. Right? I think the idea is that if more service meshes out, they make themselves a bit more flexible and meet people where they are. I think the adoption of the service mesh, and all the good stuff that comes with it, is only going to grow.”

“So I think what’s next for service mesh isn’t necessarily the service mesh itself. I think it’s people understanding how it fits into the bigger picture. And I think it’s an educational piece and where there are gaps, maybe we as vendors need to make some advances.”

The post Don’t Force Containers and Disrupt Workflows appeared first on The New Stack.

]]>
How to Decide Between a Layer 2 or Layer 3 Network https://thenewstack.io/how-to-decide-between-a-layer-2-or-layer-3-network/ Tue, 25 Apr 2023 17:00:10 +0000 https://thenewstack.io/?p=22704440

As communication service providers (CSPs) continue to provide essential services to businesses and individuals, the demand for faster and more

The post How to Decide Between a Layer 2 or Layer 3 Network appeared first on The New Stack.

]]>

As communication service providers (CSPs) continue to provide essential services to businesses and individuals, the demand for faster and more reliable network connectivity continues to grow in demand and in complexity. To meet these demands, CSPs must offer a variety of connectivity services that provide high-quality network performance, reliability and scalability.

When it comes to offering network connectivity services, CSPs have many options when providing Layer 2 (data link) or Layer 3 (network or packet layer) connectivity of the Open Systems Interconnection (OSI) model for network communication.

This article will explore some of the advantages and benefits of each type of connectivity, in order for CSPs to determine which one may be better suited for different types of environments or applications.

What Is Layer 2 Connectivity?

At a basic level, Layer 2 connectivity refers to the use of the data link layer of the OSI Model. It is often used to connect local area networks (LANs) or to provide point-to-point connectivity between two networks or even broadcast domains or devices.

Often, Layer 2 connectivity is referred to as Ethernet connectivity, as Ethernet is one of the most common Layer 2 protocol used today, and it comes several advantages.

First off, Layer 2 connectivity generally provides low latency as it requires fewer network hops than Layer 3 connectivity. This makes it ideal for applications that require low latency, such as real-time voice, video or highly interactive applications.

Layer 2 connectivity is also relatively simple to configure and maintain when compared to Layer 3 connectivity. Its connectivity reduces the complexity of network configurations by eliminating the need for complex routing protocols and configurations. This makes it an attractive option for small- to medium-sized businesses that do not have dedicated IT resources.

In addition to offering low latency and simplicity, Layer 2 connectivity also provides high network performance as it can take advantage of the full bandwidth of the network.

What Is Layer 3 Connectivity?

On the other hand, Layer 3 connectivity refers to the use of the network layer of the OSI model for network communication. It is often used to provide wide area network (WAN) connectivity, to connect different LANs and to provide access to the internet. Layer 3 connectivity is often referred to as IP connectivity, as IP is the most common Layer 3 protocol used today.

As with Layer 2, Layer 3 connectivity comes with its own set of advantages.

To start, Layer 3 connectivity is highly scalable and can handle large networks with many devices. Likewise, its connectivity provides flexibility in terms of routing and network design, making it suitable for complex network architectures.

Opposite of Layer 2, Layer 3 connectivity provides enhanced security features, including firewalls and virtual private networks (VPNs), which can protect the network from external threats.

Additionally, its connectivity can help reduce network congestion by providing more efficient routing of network traffic, versus the management of large broadcast domains.

Layer 2 vs. Layer 3 Connectivity: Which Is Better?

The decision to use Layer 2 or Layer 3 connectivity depends on the specific needs of the application(s) or network. However, there are some general guidelines to consider.

For local network connectivity, Layer 2 connectivity is generally more suitable. It provides low latency and high performance, making it ideal for real-time applications such as voice and video.

For wide-area network connectivity, on the other hand, Layer 3 connectivity is generally more suitable as it provides scalability, flexibility and enhanced security features, making it ideal for connecting different LANs and for accessing the internet.

For applications that require both local and wide area network connectivity, a combination of Layer 2 and Layer 3 connectivity might be necessary to achieve optimal network performance.

Both Layer 2 and Layer 3 connectivity have their own distinct advantages and benefits.

While Layer 2 connectivity is simple to configure, provides low latency and high performance and is ideal for local network connectivity, Layer 3 connectivity is highly scalable, flexible and provides enhanced security features, making it ideal for wide-area network connectivity.

By selecting their necessary network qualities, CSPs can determine the best network connectivity service for their application and environment.

The post How to Decide Between a Layer 2 or Layer 3 Network appeared first on The New Stack.

]]>
Linkerd Service Mesh Update Addresses More Demanding User Base https://thenewstack.io/linkerd-service-mesh-update-addresses-more-demanding-user-base/ Tue, 11 Apr 2023 13:17:14 +0000 https://thenewstack.io/?p=22704836

Five years ago, when the hype around the service mesh was at its greatest, Buoyant CEO William Morgan, fielded a

The post Linkerd Service Mesh Update Addresses More Demanding User Base appeared first on The New Stack.

]]>

Five years ago, when the hype around the service mesh was at its greatest, Buoyant CEO William Morgan, fielded a lot of questions about the company’s flagship Linkerd open source service mesh software. Many in the open source community were very curious about what what it could do, and what it could be used for.

These days, Morgan still gets questions, but now they are a lot more pointed, about how Linkerd would work in a specific situation. Users are less worried about how it works, and more concerned about just getting the job done.  So they are more direct what they want, and what they want to pay for.

“In the very early days of the service mesh, a lot of open source enthusiast who were excited about the technology wanted to get to the details, and wanted to do all the exciting stuff,” Morgan explained. “Now the audience coming in just wants it to work. They don’t want to get into the details, because they’ve got like a business to run.”

In anticipation of this year’s KubeCon + CloudNativeCon EU, Buoyant has released an update to Linkerd. Version 2.13 includes new features such as dynamic request routing, circuit breaking, automated health monitoring, vulnerability alerts, proxy upgrade assistance, and FIPS-140 “compatibility.”

And on April 18, the day before the Amsterdam-based KubeCon EU 2023 kicks off in earnest, the first-ever Linkerd Day co-located conference will be held.

What Is a Service Mesh?

Categorically, service mesh software is a tool for adding reliability, security, and observability features to Kubernetes environments. Kubernetes is a platform for building platforms, so it is not meant for managing the other parts of a distributed system, such as networking, Morgan explained.

In the networking realm, the service mesh software handles all additional networking needs beyond simple TCP handshake Kubernetes offers, such as retries, mitigating failing requests, sending traffic to other clusters, encryption, access management. The idea with the service mesh is to add a “sidecar” to each instance of the application, so developers don’t have to mess with all these aspects, of which they may not be familiar with.

There are multiple service mesh packages — Istio, Consul, Traefik Mesh and so on — but what defines LinkerD specifically is its ease-of-use, Morgan said.

“When people come to us because they recognize the value of a service mesh, they want to add it to their stack,” Morgan said. “But they want a simple version, they don’t want a complicated thing. They don’t want to have to have a team of four service mesh engineers on call.”

Buoyant likes to tout Linkerd as the Cloud Native Computing Foundation‘s “only graduated service mesh” (CNCF also provides a home for Istio, though that service mesh is still at an incubating level). The certs simply mean that Linkerd is not some “fly-by-night open source things that’s just been around for six months. It’s a recognition of the maturity of the project.”

New Features of Linkerd 2.13

For Kubernetes users, the newly-added dynamic request routing provides fine-grained control over the routing of individual HTTP and gRPC requests.

To date, Linkerd has offered a fair amount of traffic shaping, such as the ability to send a certain percentage of each traffic to a different node. Now, the level granularity is much finer, with the ability to parse traffic by, say, query parameter, or a specific URL.  Route requests can be routed based on HTTP headers, gRPC methods, query parameters, or almost any other aspect of the request.

One immediate use case that comes to mind are sticky sessions, where all a user’s transactions take place on a single node, in order to get the full benefit of caching. User-based A/B testing, canary deploys, and dynamic staging environments are some of the other possible uses. And they can be set up either by the users themselves, or even by third-party software vendors who want to offer specialized services around testing, for instance.

Linkerd’s dynamic request routing came about thanks to Kubernetes Gateway API. Leveraging the Gateway API “reduces the amount of new configuration machinery introduced onto the cluster,” Buoyant states in its press materials. Although the Gateway API standard, concerning network ingress, wasn’t specifically addressing service mesh “east-west” capabilities, many of the same types can also be used to shape east-west traffic, reliving the administrative burden of learning yet another configuration syntax, Morgan said, admiringly of the standard.

(Morgan also pointed to a new promising new initiative within the Kubernetes community, called GAMMA, which would further synthesize service mesh requirements into the API Gateway).

Another new feature with Linkerd: Circuit breaking, where Kubernetes users can mark services as delicate, so that meshed clients will automatically reduce traffic should these services start throwing a lot of errors.

Security, Gratis

A version of the 2.13 release comes in “a FIPS-compatible form,” the company asserts.

Managed by the U.S. National Institute of Standards and Technology (NIST), the Federal Information Processing Standard (FIPS, currently at version 3) is a set of standards for deploying encryption modules, with requirements around interfaces, operating environments, security management and lifecycle assurance. It is a government requirement for any software that touches encrypted traffic. Many other industries, such as finance, also follow the government’s lead in using FIP-compliant products.

That said, Linkerd is not certified for use by the U.S. government. “Compatible”  means Buoyant feels it could muster with a NIST-accredited lab, though the company has no immediate plans to certify the software.

And, finally, Buoyant itself is offering to all of Linkerd users, basic health monitoring, vulnerability reporting, and upgrade assistance, through its Buoyant Cloud SaaS automation platform. This feature is for all users, even of the open source version, and not just for paid subscribers.

“We realized a lot of Linkerd users out there are actually in a vulnerable position,” Morgan explained. “They aren’t subscribed to the security mailing lists. They’re not necessarily monitoring the health of their deployments. They’re avoiding upgrades because that sounds like a pain. So we’re trying to provide them with tools. Even if it’s pure, open source, they can at least keep their clusters secure, and healthy and up to date.”

Of course, those with the paid edition getting a more in-depth set of features.

To upgrade Linkerd 2.13 or install it new, start here, or search it out on the Azure Marketplace.

The post Linkerd Service Mesh Update Addresses More Demanding User Base appeared first on The New Stack.

]]>
Wireshark Celebrates 25th Anniversary with a New Foundation https://thenewstack.io/wireshark-celebrates-25th-anniversary-with-a-new-foundation/ Tue, 28 Mar 2023 12:00:46 +0000 https://thenewstack.io/?p=22702112

No doubt, countless engineers and hackers remember the first time they used Wireshark, or — if they’re a bit older

The post Wireshark Celebrates 25th Anniversary with a New Foundation appeared first on The New Stack.

]]>

No doubt, countless engineers and hackers remember the first time they used Wireshark, or — if they’re a bit older — Wireshark’s predecessor, Ethereal. The experience of using Wireshark is a bit like what Robert Hooke must have felt in 1665 when using the newly-developed microscope to view cells for the first time ever: What was once just an inscrutable package had opened up to reveal a treasure trove of useful information.

This year, the venerable Wireshark has turned 25, and its creators are taking a step back from this massively successful open source project, to let additional parties to help govern. This month, Sysdig, the current sponsor of Wireshark, launched a new foundation that will serve as the long-term custodian of the project. The Wireshark Foundation will house the Wireshark source code and assets, and manage the SharkFest, Wireshark’s developer and user conference (Singapore April 17-19 and San Diego June 10-15).

The creators call the software the “world’s foremost traffic protocol analyzer” with considerable justification. Just in the past five years, it has been downloaded more than 60 million times and has attracted more than 2,000 contributors. Today, Wireshark is free and available under the GNU General Public License (GPL) version 2.

Wireshark provides a glimpse into the traffic going across your network at a  packet level, allowing users to understand the system better and diagnosis problems. A built-in powerful data parsing engine is only half the appeal; an extensible design has allowed others to easily provide plug-ins for an endless array of new protocols and data formats.

There were packet analyzers prior to Ethereal, of course, though, but they were expensive.

When network engineer Gerald Combs released first this code as open source in 1998, he democratized IP packet inspection for everyone. And a few years later, when WiFi was being introduced, Ethereal, was put into action by every system administrator trying to fix a buggy WiFi connection.  It also inspired an entire generation of hackers — friendly or otherwise — to sniff out unsecured wireless connections (“wardriving“).

“Wireshark is my favourite ‘I told you so’ tool. You can’t imagine how useful it is for network troubleshooting,” one Hacker News commenter enthused.

Network Observability for All

Combs created Ethereal while working as an engineer for a Kansas City Internet Service Provider, for the purposes of troubleshooting. At the time, the only packet sniffers available were costly, and the ISP didn’t have a budget for one (which could run into tens of thousands of dollars).

This was a few years into the commercial use of the Internet, and so when Combs released Ethereal, he immediately started getting contributions from others.

One of those early contributors was Loris Degioanni, now CTO and Founder of cloud security company Sysdig. He was in school at the time. His computer network professor had said that the best way to understand the network is to observe the network. But since there were no inexpensive packet sniffers for Windows, Degioanni wrote WinPcap, a driver for capturing packets in Windows machines, which many people immediately started using with Ethereal.  

One factor for Ethereal’s success was its extensibility. It allowed many developers to work in parallel, creating plug-ins to would run on top of Ethereal’s network analysis capabilities. In this way, it was “really easy for the project to accumulate features and functionality and become more and more useful at a very rapid pace,” Degioanni said.

Contributions came in not just from students and hobbyists, but from engineers at actual companies, which found it more cost-effective to dedicate an engineer to creating and managing some obscure protocol that otherwise would require a more expensive tool to analyze.

The killer use however, came from the emerging use of wireless (WiFi) networks. When it was introduced for home use in 1999, WiFi was still incredibly buggy.  Degioanni got with Combs to develop a plug-in for inspecting  802.11 wireless traffic on Windows XP, called AirPCap, which proved to be helpful for many who just wondered why their packets seemingly vanished in the air.

With the wireless, Ethereal also attracted the attention of hackers, who could use network analysis for intercepting wireless packets of data from people and companies, as they sat outside in a car with a laptop and a copy of Ethereal.

“It’s it’s not a community, we specifically cater to, but it’s a committee that finds the tool to be useful,”  Combs said. “I don’t know that it was a surprise that the security industry latched on to it. But it has been interesting seeing how that developed.”

The two thought there would be a business for this market, so they set off to start Cacetech (since purchased by Riverbed) to manage Wireshark and related technologies. Combs’ prior employer, held the trademark for Ethereal, so the duo forked the technology, renaming it Wireshark.

Today, the software is being used across a wide range of industries, each with its own set of oddball protocols and network traffic patterns to be grappled with.  When Degioanni launched Sysdig in 2013, it immediately put Wireshark to use in helping parse log data in real time from the cloud providers, Degioanni said.

At its heart, Ethereal had a power dissection engine.  You could feed it “these blobs of data and it will break them down and tear them apart and show you all the various bits and bytes needed to its best ability,” Combs said. “And this also lets you apply filters and apply all these other powerful features. But the thing is that the engine doesn’t really care if it’s packet data, it can be any sort of data you want.”

Currently, for instance, Combs is looking to extend it to non-IP sources of data such as Bluetooth and USB devices.

New Foundation

Beyond its massive usefulness, Wireshark has also played a role in educating generations of programmers and administrators on how a network works. Just looking at the GUI as it is decodes packets off the wire, you can get the sense of how the Internet actually works.

“I think it’s important to educate people about low-level analysis, whether it’s in packets, system events or system calls, Combs said. “I think that’s very important knowledge to pass on and to educate people on.”

Education will be one of the chief missions of the new Wireshark Foundation, which will provide a formal support structure for the Wireshark, Combs said. Today, the chief income that Wireshark gets is through its conferences; with the foundation, the project will be able to accept contributions directly.

It will also provide some much-needed relief to Combs.  To date, Combs has been the chief maintainer, or the “benign dictator,” so to speak. The foundation will shift the structure to something more resembling a benign democracy.

“You can tell that all of us are starting to get a little bit of gray hair. And it’s pretty clear at this point, that Wireshark is big enough, relevant enough for the whole planet, that it is going to survive us,” Degioanni said.

The post Wireshark Celebrates 25th Anniversary with a New Foundation appeared first on The New Stack.

]]>
This Week in Computing: Malware Gone Wild https://thenewstack.io/this-week-in-computing-malware-gone-wild/ Sat, 25 Mar 2023 14:10:18 +0000 https://thenewstack.io/?p=22703513

Malware is sneaky AF. It tries to hide itself and cover up its actions. It detects when it is being

The post This Week in Computing: Malware Gone Wild appeared first on The New Stack.

]]>

Malware is sneaky AF. It tries to hide itself and cover up its actions. It detects when it is being studied in a virtual sandbox, and so it sits still to evade detection. But when it senses a less secure environment — such as an unpatched Windows 7 box — it goes wild, as if possessing a split personality.

In other words, malware can no longer be fully understood simply by studying it in a lab setting, asserted the University of Maryland Associate professor Tudor Dumitras, in a recently posted talk from USENIX‘s last-ever Enigma security and privacy conference.

Today, most malware is examined by examining execution traces that the malicious program generates (“Dynamic Malware Analysis”). This is usually done in a controlled environment, such as a sandbox or virtual machine. Such analysis creates the signatures to describe the behavior of the malicious software.

The malware community, of course, has been long hip to this scrutiny, and has developed an evasion technique known as red pills, which helps malware detect when it is in a controlled environment, and change its behavior accordingly.

As a result, many of the signatures used for commercial malware detection packages may not be able to adequately to identify malware in all circumstances, depending on what traces the signature actually captured.

What we really need, Dumitras said, is execution traces from the wild. Dumitras led a study that collected info on real-world attacks, consisting of over 7.6 million traces from 5.4 million users.

“Sandbox traces can not account for the range of behaviors encountered in the wild.”

They had found that, as Dumitras expected, traces collected in a sandbox rarely capture the full behavior of malware in the wild.

In the case of Wannacry ransom attack, for instance, sandbox tracing only caught 18% of all the actions that the randomware attack executed in the wild.

For the keepers of malware detection engines, Dumitras advised using traces from multiple executions in the wild. He advised using three separate traces, as diminishing returns set in after that.

Full video of the talk here:

Reporter’s Notebook

“So far, having an AI CEO hasn’t had any catastrophic consequences for NetDragon Websoft. In fact, since Yu’s appointment, the company has outperformed Hong Kong’s stock market.” — The Hustle, on replacing CEOs with AI Chatbots.

AI “Latent space embeddings end up being a double-edged sword. They allow the model to efficiently encode and use a large amount of data, but they also cause possible problems where the AI will spit out related but wrong information.” — Geek Culture, on why ChatGPT lies.

“We think someone who writes for a living needs to constantly be thinking about the best way to express complex ideas in their own words.” ⁦– Wired, on its editorial use of generative AI.

“I think with Kubernetes, we did a decent job on the backend. But we did not get developers, not one little bit. That was a missed opportunity to really bring the worlds together in a natural way” — Kubernetes co-founder Craig McLuckie, on how the operations-centric Kubernetes perplexed developers (See: YAML), speaking at a Docker press roundtable this week.

McLuckie also noted that 60% of machine learning workloads now run on Kubernetes.

“After listening to feedback and consulting our community, it’s clear that we made the wrong decision in sunsetting our Free Team plan. Last week we felt our communications were terrible but our policy was sound. It’s now clear that both the communications and the policy were wrong, so we’re reversing course and no longer sunsetting the Free Team plan” —Docker, responding to the outcry in the open source community over the suspension of its free Docker Hub tier for teams.

“Decorators are by far the biggest new feature, making it possible to decorate classes and their members to make them more easily reusable. […] Decorators are just syntactic glue aiming to simplify the definition of higher-order functions” — Software Engineer Sergio De Simone on the release of TypeScript 5.0, in InfoQ.

“If these details cannot be hidden from you, and you need to build a large knowledge base around stuff that does not directly contribute to implementing your program, then choose another platform.” — Hacker News commenter, on the needless complexity that came with using Microsoft Foundation Classes (MFC) for C++ coding.

Now 25 years old, the venerable Unix curl utility can now enjoy an adult beverage in New Dehli.

Ken Thompson “has a long and storied history of trolling the computer industry […] he revealed, during his Turing Award lecture, that he had planted an essentially untraceable back door in the original C compiler… and it was still there.” — Liam Proven, The Register.

“It’s just like planning a dinner. You have to plan ahead and schedule everything so it’s ready when you need it.” —  Grace Hopper, 1967, explaining programming to the female audience of Cosmopolitan.

The post This Week in Computing: Malware Gone Wild appeared first on The New Stack.

]]>
JWTs: Connecting the Dots: Why, When and How https://thenewstack.io/jwts-connecting-the-dots-why-when-and-how/ Mon, 20 Mar 2023 14:10:34 +0000 https://thenewstack.io/?p=22702931

JSON web tokens (JWTs) are great — they are easy to work with and stateless, requiring less communication with a

The post JWTs: Connecting the Dots: Why, When and How appeared first on The New Stack.

]]>

JSON web tokens (JWTs) are great — they are easy to work with and stateless, requiring less communication with a centralized authentication server. JWTs are handy when you need to securely pass information between services. As such, they’re often used as ID tokens or access tokens.

This is generally considered a secure practice as the tokens are usually signed and encrypted. However, when incorrectly configured or misused, JWTs can lead to broken object-level authorization or broken function-level authorization vulnerabilities. These vulnerabilities can expose a state where users can access other data or endpoints beyond their privileges. Therefore, it’s vital to follow best practices for using JWTs.

Knowing and understanding the fundamentals of JWTs is essential when determining a behavior strategy.

What Are JWTs?

JWT is a standard defined in RFC 7519, and its primary purpose is to pass a JSON message between two parties in a compact, URL-safe and tamper-proof way. The token looks like a long string divided into sections and separated by dots. Its structure depends on whether the token is signed (JWS) or encrypted (JWE).

JWS Structure

JWE Structure

Are JWTs Secure? 

The short answer is that it depends. The security of JWTs is not a given. As mentioned above, JWTs are often considered secure because they are signed or encrypted, but their security really depends on how they are used. A JWT is a message format in which structure and security measures are defined by the RFC, but it is up to you to ensure their use does not harm the safety of your whole system in any way.

When to Use JWTs

Should they be used as access and ID tokens?

JWTs are commonly used as access tokens and ID tokens in OAuth and OpenID Connect flows. They can also serve different purposes, such as transmitting information, requesting objects in OpenID Connect, authenticating applications, authorizing operations and other generic use cases.

Some say that using JWTs as access tokens is an unwise decision. However, in my opinion, there is nothing wrong if developers choose this strategy based on well-done research with a clear understanding of what JWTs essentially are. The worst-case scenario, on the other hand, is to start using JWTs just because they are trendy. There is no such thing as too many details when it comes to security, so following the best practices and understanding the peculiarities of JWTs is essential.

JWTs are by-value tokens containing data intended for the API developers so that APIs can decode and validate the token. However, if JWTs are issued to be used as access tokens to your clients, there is a risk that client developers will also access this data. You should be aware that this may lead to accidental data leaks since some claims from the token should not be made public. There is also a risk of breaking third-party integrations that rely on the contents of your tokens.

Therefore, it is recommended to:

  • Remember that introducing changes into JWTs used as access tokens may cause problems with app integrations.
  • Consider switching to Phantom tokens or Split tokens when sensitive or personal information is used in a token. In these cases, an opaque token should be used outside your infrastructure.
  • When a high level of security is required, use Proof-of-Possession tokens instead of Bearer tokens by adding a confirmation claim to mitigate the risks of unwanted access.

Should they be used to handle sessions?

An example of improper use of JWTs is choosing them as a session-retention mechanism and replacing session cookies and centralized sessions with JWTs. One of the reasons you should avoid this tactic is that JWTs cannot be invalidated, meaning you won’t be able to revoke old or malicious sessions. Size issues pose another problem, as JWTs can take up a lot of space. Thus, storing them in cookies can quickly exceed size limits. Solving this problem might involve storing them elsewhere, like in local storage, but that will leave you vulnerable to cross-site scripting attacks.

JWTs were never intended to handle sessions, so I recommend avoiding this practice.

Claims Used in JWTs and How to Handle Them

JWTs use claims to deliver information. Properly using those claims is essential for security and functionality. Here are some basics on how to deal with them.

Claim Function Best Practices
iss Shows the issuer of the token
  • Always check against an allowlist to ensure it has been issued by someone you expect to issue it.
  • The value of the claim should exactly match the value you expect it to be.
sub Indicates the user or other subject of the token
  • As anyone can decode the token and access the data, avoid using sensitive data or PII.
aud Indicates the receiver of the token
  • Always verify that the token is issued to an audience you expect.
  • Reject any request intended for a different audience.
exp Indicates the expiration time for the token
  • Use a short expiration time — minutes or hours at maximum.
  • Remember that server times can differ slightly between different machines.
  • Consider allowing a clock skew when checking the time-based values.
  • Don’t use more than 30 seconds of clock skew.
  • Use iat to reject tokens that haven’t expired but which, for security reasons, you deem to be issued too far in the past.
nbf Identifies the time before which the JWT must not be accepted for processing
iat Identifies the time at which the JWT was issued
jti Provides a unique identifier for the token
  • It must be assigned in a way that prevents the same value from being used with a different object.

Validating Tokens

It is important to remember that incoming JWTs should always be validated. It doesn’t matter if you only work on an internal network (with the authorization server, the client and the resource server not connected through the internet). Environment settings can be changed, and if services become public, your system can quickly become vulnerable. Implementing token validation can also protect your system if a malicious actor is working from the inside.

When validating JWTs, always make sure they are used as intended:

  • Check the scope of the token.
  • Don’t trust all the claims. Verify whether keys contained in claims, or any URIs, correspond to the token’s issuer and contain a value that you expect.

Best Algorithms to Use with JWTs

The registry for JSON Web Signatures and Encryption Algorithms lists all available algorithms that can be used to sign or encrypt JWTs. It is also very useful to help you choose which algorithms should be implemented by clients and servers.

Currently, the most recommended algorithms for signing are EdDSA or ES256. They are preferred over the most popular one, RS256, as they are much faster than the well-tried RS256.

No matter the token type — JWS or JWE — they contain an alg claim in the header. This claim indicates which algorithm has been used for signing or encryption. This claim should always be checked with a safelist of algorithms accepted by your system. Allowlisting helps to mitigate attacks that attempt to tamper with tokens (these attacks may try to force the system to use different, less secure algorithms to verify the signature or decrypt the token). It is also more efficient than denylisting, as it prevents issues with case sensitivity.

How to Sign JWTs

One thing to remember about JWS signatures is that they are used to sign both the payload and the token header. Therefore, if you make changes to either the header or the payload, whether merely adding or removing spaces or line breaks, your signature will no longer validate.

My recommendations when signing JWTs are the following:

  • To avoid duplicating tokens, add a random token ID in the jti claim. Many authorization servers provide this opportunity.
  • Validate signatures, keys and certificates. Keys and certificates can be obtained from the authorization server. A good practice is to use an endpoint and download them dynamically. This makes it easy to rotate keys in a way that would not break the implementation.
  • Check the keys and certificates sent in the header of the JWS against an allowlist, or validate the trust chain for certificates.

Symmetric keys are not recommended for use in signing JWTs. Using symmetric signing presupposes that all parties need to know the shared secret. As the number of involved parties grows, it becomes more difficult to guard the safety of the secret and replace it if it is compromised.

Another problem with symmetric signing is that you don’t know who actually signed the token. When using asymmetric keys, you’re sure that the JWT was signed by whoever possesses the private key. In the case of symmetric signing, any party with access to the secret can also issue signed tokens. Always choose asymmetric signing. This way, you’ll know who actually signed the JWT and make security management easier.

JWTs and API Security

API security has become one of the main focuses of cybersecurity efforts. Unfortunately, vulnerabilities have increased as APIs have become critical for overall functionality. One of the ways to mitigate the risks is to ensure that JWTs are used correctly. JWTs should be populated with scopes and claims that correspond well to the client, user, authentication method used and other factors.

Conclusion

JWTs are a great technology that can save developers time and effort and ensure the security of APIs and systems. To fully reap their benefits, however, you must ensure that choosing JWTs fits your particular needs and use case. Moreover, it is essential to make sure they are used correctly. To do this, follow the best practices from security experts.

Here are some additional guidelines:

The post JWTs: Connecting the Dots: Why, When and How appeared first on The New Stack.

]]>
Palo Alto Networks Adds AI to Automate SASE Admin Operations https://thenewstack.io/palo-alto-networks-adds-ai-to-automate-sase-admin-operations/ Fri, 17 Mar 2023 13:00:10 +0000 https://thenewstack.io/?p=22702821

Whether one pronounces SASE as “sassy” or “sayce,” a secure access service edge is IT that is fast becoming central

The post Palo Alto Networks Adds AI to Automate SASE Admin Operations appeared first on The New Stack.

]]>

Whether one pronounces SASE as “sassy” or “sayce,” a secure access service edge is IT that is fast becoming central to enterprise systems as increasing amounts of data come into them from a multiplicity of channels. SASE is used to distribute wide-area network and security controls as a cloud service directly to the source of connection at the edge of the network rather than to a data center.

As its contribution to managing this tangle of virtual wires, Palo Alto Networks this week revealed new capabilities to update its Prisma SASE platform by — you guessed it — adding AIOps to automate these operations and make them more efficient. The company describes this as the industry’s first “natively integrated Artificial Intelligence for IT Operations” for SASE, because it brings together what normally are best-of-breed components (SDN, zero trust security, software-defined secure web gateway) into the same centralized package.

New Features

The new features enable organizations to automate their increasingly complex IT and network operations center (NOC) functions, Palo Alto Networks VP of SASE Marketing Matt De Vincentes told The New Stack.

“You can mix and match these components from multiple different vendors, and you get a potential stack when you have these capabilities kind of integrated together,” De Vincentes said. “But increasingly, we’re seeing a movement toward what we call single-vendor SASE, which is all of these capabilities brought together by a single thing that you can simplify. That’s exactly what we’re doing.

“So all of the capabilities that a customer would need to build out this SASE deployment they can get through a single (SaaS) service. Then on top of that, with one vendor you can bring all the data together into one single data lake — and do some interesting AI on top of that.”

AIOps

Palo Alto Networks calls this Autonomous Digital Experience Management (ADEM), which also provides users end-to-end observability across their network, De Vincentes said. Since ADEM is integrated within Prisma SASE, it does not require additional appliances or agents to be deployed, De Vincentes said.

Capabilities that AIOps for ADEM provides are, according to De Vincentes:

  • proactively remediates issues that can cause service interruption through AI-based problem detection and predictive analytics;
  • isolates issues faster (reduced mean time to repair) through an easy-to-use query interface; and
  • discovers network anomalies from a single dashboard.

PA Networks also announced three new SD-WAN (software-defined wide-area network) features for users to secure IoT devices, automate branch management, and manage their SD-WAN via on-premises controllers. Capabilities, according to the company, include:

  • Prisma SD-WAN Command Center provides AI-powered and segment-wise insights and always-on monitoring for network and apps for proactive problem resolution at the branch level.
  • Prisma SD-WAN with integrated IoT security enables existing Prisma SD-WAN appliances to help secure IoT devices. This enables accurate detection and identification of branch IoT devices.
  • On-Prem Controller for Prisma SD-WAN helps meet customer regulatory and compliance requirements and works with on-prem and cloud controller deployments.

Users can now elect to deploy Prisma SD-WAN using the cloud-management console, on-prem controllers, or both in a hybrid scenario, the company said.

All new capabilities will be available by May 2023, except the Prisma SD-WAN Command Center, which will be available by July, the company said.

The post Palo Alto Networks Adds AI to Automate SASE Admin Operations appeared first on The New Stack.

]]>
TrueNAS SCALE Network Attached Storage Meets High Demand https://thenewstack.io/truenas-scale-network-attached-storage-meets-high-demand/ Thu, 02 Mar 2023 15:13:27 +0000 https://thenewstack.io/?p=22701589

TrueNAS SCALE might not be a distribution on the radar of most cloud native developers, but it should be. Although

The post TrueNAS SCALE Network Attached Storage Meets High Demand appeared first on The New Stack.

]]>

TrueNAS SCALE might not be a distribution on the radar of most cloud native developers, but it should be. Although TrueNAS SCALE is, by design, a network-attached storage solution (based on Debian), it is also possible to create integrated virtual machines and even Linux containers.

TrueNAS SCALE can be deployed as a single node or even to a cluster. It can be expanded with third-party applications, offers snapshotting, and can be deployed on off-the-shelf hardware or as a virtual machine.

IXsystems‘ TrueNAS SCALE is built on TrueNAS CORE, is designed for hybrid clouds, and will soon offer enterprise support options. The operating system is powered by OpenZFS and Gluster for scalable ZFS features and data management.

You’ll find support for KVM virtual machines, Kubernetes, and Docker.

Even better TrueNAS SCALE is open-source and free to use.

Latest Release

Recently, the company launched TrueNAS SCALE 22.12.1 (Bluefin), which includes numerous improvements and bug fixes. The list of improvements to the latest release includes the following:

  • SMB Share Proxy to provide a redirect mechanism for SMB shares in a common namespace.
  • Improvements to rootless login.
  • Fixes to ZFS HotPlug.
  • Improved Dashboard for both Enterprise HA and Enclosure management.
  • Improved Host Path Validation for SCALE applications.
  • Support for external share paths added.

There have also been a number of new features added to the latest release, including the following:

  • SSH Key Upload to simplify and better secure remote access for users.
  • DFS Proxy Share
  • Kubernetes Pass-Through enables external access to the Kubernetes API within a node.
  • Improved first UI login (when root password has not been set).
  • Allow users to create and manage ACL presets.
  • Sudo fields to provide correct privileges for remote targets.

Read the entire changelog to find out all of the improvements and new features that were added to TrueNAS SCALE.

Up-Front Work

One thing to keep in mind when considering TrueNAS SCALE is that there is a bit of up-front work you must do to make it work. Upon installation of the OS, you’ll have to create storage pools, users, shares, and more. There is a bit of a learning curve with this NAS solution, but the end result is very much worth the time you’ll spend making it work.

As far as the web UI is concerned, you’ll find it to be incredibly well-designed (Figure 1).

Figure 1: The default TrueNAS SCALE web UI is a thing of beauty.

In order to use the Virtualization feature, your CPU must support KVM extensions. This can be problematic when using TrueNAS as a virtual machine (with the likes of VirtualBox). To make this work, you must enable Nested Virtualization. Here’s how you do that.

First, create the virtual machine for TrueNAS. Once the VM is created, you’ll need to find the .vbox file in the TrueNAS VirtualBox folder. Open that file for editing (in my example, the file is TRUENAS.vbox). Look for the following section:

<CPU count="2">
  <PAE enabled="false"/>
  <LongMode enabled="true"/>
  <X2APIC enabled="true"/>
  <HardwareVirtExLargePages enabled="true"/>
</CPU>


Add the following line to that section:

<NestedHWVirt enabled=”true”/>


The new section should look like this:

<CPU count="2">
<PAE enabled="false"/>
<LongMode enabled="true"/>
<X2APIC enabled="true"/>
<HardwareVirtExLargePages enabled="true"/>
  <NestedHWVirt enabled=”true”/>
</CPU>

The GUI Method

If you prefer the GUI method, open the Settings for the VM, go to System, and click the checkbox for Enable Nested VT-x/AMD-V, and click OK. Start the VM and Virtualization should now work. You’ll know if it’s working if you click on the Virtualization section and you see Add Virtual Machine (Figure 2).

Figure 2: Virtualization is now enabled for our TrueNAS VM.

In a soon-to-be-written tutorial, I will show you how to start working with containers via TrueNAS. Until then, I highly recommend you download an ISO image of this incredible NAS solution, install it, create your pools/users/shares, and start enjoying the ability to share files and folders to your network.

The post TrueNAS SCALE Network Attached Storage Meets High Demand appeared first on The New Stack.

]]>
How Secure Is Your API Gateway? https://thenewstack.io/how-secure-is-your-api-gateway/ Tue, 28 Feb 2023 13:16:57 +0000 https://thenewstack.io/?p=22701351

Quick, how many APIs does your organization use? We’re talking for internal products, for external services and even for infrastructure

The post How Secure Is Your API Gateway? appeared first on The New Stack.

]]>

Quick, how many APIs does your organization use? We’re talking for internal products, for external services and even for infrastructure management such as Amazon’s S3 object storage or Kubernetes. If you don’t know the answer, you are hardly alone. In survey after survey, CIOs and CISOs admit they don’t have an accurate catalog of all their APIs. Yet the march toward greater use of APIs is inevitable, driven by continued adoption of API-centric technology paradigms like cloud native computing and microservices.

According to statistics shared by Mark O’Neill, chief of research for software engineering at Gartner, in 2022:

  • 98% of organizations use or are planning to use internal APIs, up from 88% in 2019
  • 94% of organizations use or are planning to use public APIs provided by third parties, up from 52% in 2019
  • 90% of organizations use or are planning to use private APIs provided by partners, up from 68% in 2019
  • 80% of organizations provide or are planning to provide publicly exposed APIs, up from 46% in 2019

API Gateways Remain Critical Infrastructure Components

To deal with this rapid growth and the management and security challenges it creates, CIOs, Platform Ops teams, and cloud architects are turning to API gateways to centrally manage API traffic. API gateways help discover, manage, observe and secure API traffic on a network.

In truth, API gateways are a function that can be performed by either a reverse proxy or load balancer, and increasingly, an ingress controller. We know this for a fact because many NGINX open source users configure their NGINX instances specifically to manage API traffic.

This requires considerable customization, however, so it’s not surprising that many DevOps teams instead choose to deploy an API gateway that is already configured to handle some of the most important use cases for API management, like NGINX Plus.

API gateways improve security by acting as a central point of control and access for external applications accessing APIs. They can enforce authentication and authorization policies, as well as implement rate limiting and other security measures to protect against malicious attacks and unauthorized access.

Additionally, API gateways can encrypt data in transit and provide visibility and monitoring capabilities to help identify and prevent security breaches. API gateways can also prioritize traffic, enforce service-level agreements (SLAs) or business decisions around API usage, and conserve network and compute resources.

Once installed and fully deployed, API gateways tend to be sticky and hard to remove. So ensuring that you pick the right API gateway the first time is imperative. The stakes are high. Not all API gateways offer the same level of security, latency, observability and flexibility.

Some rely on underlying open source technologies that can cause security vulnerabilities or difficulties with reliability. Others may require cumbersome integration steps and generate unforeseen traffic latencies. All of these can affect the security of your API gateway and need to be considered during the selection process.

What’s under the Hood Matters — A Lot

The majority of API gateway solutions on the market are built atop modified versions of open source software. NGINX, HAProxy, Envoy and Traefik are all commonly used. However, many API gateway solutions are closed source (they use open source wrapped in proprietary code). That said, such proprietary solutions are still completely dependent on the underlying security of the open source components.

This can create significant security gaps. When a vulnerability is announced in an open source project underlying a proprietary API gateway solution, it may take months for the gateway vendor to push a security patch because any changes to the reverse proxy layer require regression testing and other quality assurance measures to ensure the fix does not affect stability or performance. Attackers know this and often look to target the exposed and unpatched open source layers in these products.

The bottom line? You need to know which technologies are part of your API gateway. Dependencies on third parties for modules and foundational components, either open source or proprietary, can generate unacceptable risks if you require a highly secure solution for your APIs.

Audit Your Security Dependencies with a Software Bill of Materials

Creating a software bill of material (SBOM) is one of the most common ways to assess potential vulnerabilities. Simply put, an SBOM is a detailed inventory of all the software components, commercial and open source, that make up an application. To learn more about SBOMs, read “Create a Software Bill of Materials for Your Operating System.”

Once you have a full picture of your software stack, you can assess whether all your items meet your security and compliance standards. You’ll often find that many tools have embedded dependencies within them. Some projects are actively maintained and release patches for known CVEs (common vulnerabilities and exposures) with a standardized service-level agreement.

But many major open source projects are not commercial entities, so they might not issue SLAs on vulnerability disclosure or guaranteed patch times, leaving you more vulnerable to CVEs. That, in turn, can unintentionally put your services out of compliance with the required standards. For those reasons, you need to verify whether each individual component in your SBOM can be made compliant.

You can read “How to Prepare Your Apps for Regulated Markets” for more information about auditing your software technology stack.

Easy Integration with Other Security Controls Is Critical

While API gateways are a critical part of API security, they are only one element. Most organizations running API gateways also need a web application firewall (WAF) in front of their gateway to block attacks (OWASP API Security Top 10 and others). If their infrastructure is distributed, they need more than one WAF. In larger enterprises, the API gateway needs to integrate with a global firewall that polices all traffic going in or out.

Even newer API security solutions that help with challenges like API discovery and threat analysis depend on robust integration with an API gateway. These tools often rely on the API gateway for visibility into API traffic, and usually work with the API gateway to address any emerging threats.

In all cases, tight integration between the API gateway and security tools is critical for maintaining effective security. It’s most convenient if you can use a single monitoring solution to track both firewall and gateway traffic.

This can be a challenging integration, particularly if an organization is operating in a multicloud or hybrid environment. Integration challenges can also mean that changes to gateway configurations require updates to the WAF or global firewall, adding to team workloads or — worst case — slowing down application development teams that have to wait for their firewall or gateway configuration requests to be synced.

Policy Granularity Can Vary Widely across Environments

In theory, an API gateway can enforce the same policy no matter what environment it is operating in. The reality is very different if you have to build your API gateways from different mixtures of components in different environments.

For example, an API management solution might use one underlying open source technology for on-premises or hosted installations of its API gateway and another for cloud services. Policy granularity and all of the resulting security benefits can also be starkly limited by the underlying foundational reverse proxy itself, or the mismatch of capabilities between the two implementations.

For these reasons, it’s critical to run an extensive proof of concept (POC) that closely emulates live production traffic. It’s the only way to be sure the API gateway solution can provide the type of policy granularity and control you require for your growing API constellation.

Inadequate policy granularity and control can result in less agile security capabilities, often reducing the API gateway to a blunt instrument rather than the finely honed scalpel required for managing the rapidly shifting attack surface of your API landscape.

Speed Matters to Application Development Teams

How fast an API gateway can pass traffic safely while still enforcing policies is of critical importance to application teams and API owners. Slow APIs can affect overall performance of applications in compounding ways by forcing dependent processes to wait, generating a poor user experience.

Teams forced to deal with slow APIs are more likely to circumvent security systems or roll their own to improve performance and better control user experience and dependencies. This is the API equivalent of shadow IT, and it creates considerable security risks if APIs are not properly locked down, tested and monitored.

The API gateway alone must be fast. But it’s equally important to look at the latency hit generated by the combination of WAF and API gateway. Ideally, the two are tightly integrated, reducing the need to slow down packets. This is another reason why a near-production POC is crucial for making the right decision.

Conclusion: Your API Gateway Security Mileage Can Vary — Choose Wisely

APIs are the future of technology infrastructure and composable, loosely coupled applications. Their rapid proliferation is likely to accelerate as more and more organizations move to the cloud, microservices and other decoupled and distributed computing paradigms.

Even if you are going against the tide and moving in the opposite direction to monoliths, your applications still need to manage APIs to communicate with the rest of the world, including partners, customers, storage layers, payment providers like Stripe, critical cloud services like CDNs and more.

An API gateway is a serious purchase requiring careful consideration. The most important consideration of all, naturally, is security.

The four criteria we laid out in this post — reliable underlying technology, easy integration with security tools, policy granularity across environments and low latency — are just a few of the many boxes an API gateway needs to check before you put it into production.

Choose wisely, think deeply, and may the API force be with you!

The post How Secure Is Your API Gateway? appeared first on The New Stack.

]]>
Bullet-Proofing Your 5G Security Plan https://thenewstack.io/bullet-proofing-your-5g-security-plan/ Fri, 24 Feb 2023 15:24:00 +0000 https://thenewstack.io/?p=22700996

With latency improvements and higher data speeds, 5G represents exponential growth opportunities with the potential to transform entire industries —

The post Bullet-Proofing Your 5G Security Plan appeared first on The New Stack.

]]>

With latency improvements and higher data speeds, 5G represents exponential growth opportunities with the potential to transform entire industries — from fueling connected autonomous vehicles, smart cities, mixed reality technologies, robotics and more.

As enterprises rethink connectivity, 5G will be a major investment area. However, according to Palo Alto Networks’ What’s Next in Cyber survey, while 88% of global executives say they understand the security challenges associated with 5G, only 21% of them have a plan to address such challenges.

As is true for any emerging technology, there will always be a level of uncertainty. However, with a few key considerations, executives across industries can foster confidence in their organization’s ability to handle 5G security challenges.

Outlining the Framework 

A comprehensive 5G security plan is built by first outlining the framework and identifying the key security principles that should inform every component of the plan. The goal, of course, is to navigate risks and secure your organization’s 5G network while also advancing digital transformation. Your framework should ultimately center around visibility, control, enforcement, dynamic threat correlation and life cycle.

Implementing visibility is key to having a complete understanding of the enterprise 5G network. Data logs, for instance, can capture data from multiple systems to better secure an entire environment. In terms of control, it’s beneficial to use cloud-delivered advanced threat protection to command and control traffic, as well as detect malware.

Adopting a zero trust model can ensure strong enforcement and consistent security visibility across the entire network, while dynamic threat correlation can help isolate infected devices. Last, by shedding light on usage patterns and potential security gaps, you can stay one step ahead of an evolving threat landscape.

Embracing AI and Automation 

With the expanded surface area of a 5G network spanning multiaccess edge computing (MEC), network slices and instantaneous service orchestration, there is much more room for potential threats. Coupled with the proliferation of user-owned devices and IoT, this highly distributed environment creates grounds for threats to evolve faster and do more damage.

Given this, automation plays an important role in building a secure 5G network. With advanced automation, organizations can alleviate the stress put on their cybersecurity teams to scan a multitude of areas for potential threats. As new services are configured and added to the 5G network, automation also helps to quickly scan and serves as a repeatable approach to deploying security.

Additionally, as threat actors leverage AI to automate attacks, similar technology is needed at the organizational level to best defend. With the complexity of 5G deployments, an AI-powered approach can intelligently stop attacks and threats while also providing granular application identification policies to protect against advanced threats regardless of their origin.

Adopting Zero Trust 

Zero trust for 5G means removing implicit trust and continuously monitoring and approving each stage of digital interaction. This means that regardless of what the situation is, who the user is or what application they are trying to gain access to, each interaction has to be validated. On a network security level, zero trust specifically protects sensitive data and critical applications.

More specifically, zero trust leverages network segmentation, provides Layer 7 threat prevention, prevents lateral movement and simplifies granular user-access control. Whereas a traditional security model operates under the assumption that everything within the organization’s purview can be trusted, this model understands that trust is a vulnerability. Ultimately, zero trust provides an opportunity for your organization to rethink security and keep up with digital transformation.

5G represents a paradigm shift and has the potential to expand connectivity. As your organization embarks on its own journey toward a 5G future, security cannot be an afterthought. Building a strong 5G security plan must start from the ground up as new, sophisticated cyberattacks are always looming. However, by building an informed framework, leveraging AI and automation, and implementing a zero trust framework, your organization will enjoy the innovation, reliability and performance that 5G has to offer.

The post Bullet-Proofing Your 5G Security Plan appeared first on The New Stack.

]]>
What David Flanagan Learned Fixing Kubernetes Clusters https://thenewstack.io/what-david-flanagan-learned-fixing-kubernetes-clusters/ Fri, 17 Feb 2023 18:54:39 +0000 https://thenewstack.io/?p=22700777

People are mean. That’s one of the first things David Flanagan learned by fixing 50+ deliberately broken Kubernetes clusters on

The post What David Flanagan Learned Fixing Kubernetes Clusters appeared first on The New Stack.

]]>

People are mean. That’s one of the first things David Flanagan learned by fixing 50+ deliberately broken Kubernetes clusters on his YouTube series, “Klustered.”

In one case, the submitter substituted a ‘c’ character with a unicode doppleganger — it looked identical to a c on the terminal output — thus causing an error that led to Flanagan doubting himself and his ability to fix clusters.

“I really hate that guy,” Flanagan confided at the Civo Navigate conference last week in Tampa. “That was a long episode, nearly two hours we spent trying to fix this. And what I love about that clip — because I promise you, I’m quite smart and I’m quite good with Kubernetes — but it had me doubting things that I know are not the fault. The fact that I thought a six digit number is going to cause any sort of overflow on a 64 bit system — of course not. But debugging is hard.”

After that show, Klustered adopted a policy of no Unicode breaks.

“You only learn when things go wrong,” Flanagan said. “This is why I really love doing Klustered. If you just have a cluster that just works, you’re never really going to learn how to operate that beyond a certain level of scale. And Klustered brings us a situation where we can have people bring their failures from their own companies, their own organizations, their own teams, and we replicate those issues on a live stream format, but it allows us to see how individuals debug it as well.”

Linux Problems

Debugging is hard, he said, even when you have a team from Red Hat working to resolve the problem, as he learned during another episode featuring teams from Red Hat and Talos. In that situation, Red Hat had removed the executable bit from important binaries such as kubectl, kubeadm, and even Perl — which has the ability to execute most Sys calls on a machine; limiting the Talos ability to fix the fault.

“What we learned from this episode is you can actually execute the dynamic linker on Linux. So we have this ld-linux.so you can actually execute any binary on a machine, proxying it through that linker. So you can bin.chmod, like so, which is a really cool trick.”
/lib/ld-linux.so  /bin.chmod +x /bin/chmod

People have also modified attributes on a Linux file system.

“Anyone know what attributes are in a Linux file system?” He asked. “No, of course not. Why should you?”

But these attributes allow you to get really low level and to the file system. He showed how they marked a file as immutable.

“So you can pack a file that you know, kubectl or Kubernetes has to write to and mark it as immutable, and you’ve immediately broken the system,” he said. “You’re not going to be able to detect that break by running your regular LS commands, you actually do need to do an lsattr on the file, and understand what these obscure references mean when you list them all. So, again, Klustered just gives us an environment where we get to extract all of this knowledge from people that have done stuff that we haven’t done before.”

On another episode, he had Kris Nóva, a kernel hacker who has worked in security and Kubernetes, along with Thomas Stromberg, a previous maintainer of minikube while Google, who has also worked in forensic analysis of intrusions. Stromberg had to fix the broken cluster by Nova, a security industry elite.

“Thomas came on and runs this FLS command,” he said. “It’s very old toolkit, written in the late 90s, called Sleuth Kit that does forensic analysis of Linux file systems.”

“By running this command, he got a time ordered change of every modification to the Linux file system. He had every answer to every question he wanted to answer for the last 48 hours…. So I love that we have these opportunities of complete serendipity to share knowledge with everyone,” he added.

Network Breaks Common

Networking breaks are often fairly common on that show. Kubernetes has core networking policies in place to keep them from happening…but still, it happens.

“However, we’re now seeing fragmentation as other CNI providers bring on their own adaptations to network policies,” Flanagan relayed. “It’s not enough to check for network policies or cluster network policies. …You need to know to successfully operate a Kubernetes cluster from a networking level [that] continues to evolve and get very cumbersome, scary, complicated, but also easier.”

Flanagan’s biggest frustration with Kubernetes is the default DNS policy.

“Who thinks the default DNS policy in Kubernetes is the default DNS policy? Now we have this DNS policy called default,” he said. “But it’s not the default. The default is cluster first, which means it’s going to try and resolve the DNS name within the cluster. And the default policy actually resolves to the default routing on the host.”

Flanagan said he’s been discussing with people like Tom Hockin and other commentators of Kubernetes how the community can remove some of the anomalies that are out there essentially tripping up people who just haven’t encountered these problems before.

Ebpf Changing the Landscape

eBPF is changing the landscape as well, he said. Rather than  go into a Linux machine anymore, and run IP tables -l, which he noted has been ingrained into developer’s skulls for the past 20 years. Now developers are supposed to listen to to all the eBPF probes and traffic policies. And essentially, you need to have other eBPF tools that can understand the existing eBPF tools.

He recommended checking out Hubble for a visual representation of older network policies — Kubernetes and Cilium specifically, he added. Hubble also ships with a CLI.

“We have the tools to understand networking within our cluster. If you’re lucky enough to be using Cilium, if you’re using other CNI, you will have to find other tools, but they do exist as well,” he said.

He also recommended Cilium Editor.

“You can build a Kubernetes networking policy, or a Cilium network policy by dragging boxes, changing labels and changing port numbers,” Flanagan said. “So you don’t actually need to learn how to navigate these esoteric YAML files anymore.”

Ciluim Editor will allow you to use drag-and-drops to build out a Kubernetes networking policy, he said.

Other Learnings

There are other ways to break Kubernetes clusters, of course. You can attack the container runtime, he noted. People have rolled back the kubectl binary as many as 25 versions; 25 versions is what it took to actually break backwards compatibility so that it can’t speak to the API server. Storage is another consideration with your own CSI providers, he added.

He also recommended three resources:
Brendan Gregg’s book;
BCC;
Ebpfkit;

What he’d like to normalize is engineers admitting what they don’t know and sharing knowledge.

“The one rule I give people is please don’t sit there quietly, Googling off camera to get an answer and go, Oh, I know how to fix this,” he said. “I’d love to get senior engineers to set better norms for the newcomers in our industry and remove the hero culture we’ve established over the last 30 years.”

Civo paid for Loraine Lawson’s travel and accommodations to attend the conference.

The post What David Flanagan Learned Fixing Kubernetes Clusters appeared first on The New Stack.

]]>
API Gateway, Ingress Controller or Service Mesh: When to Use What and Why https://thenewstack.io/api-gateway-ingress-controller-or-service-mesh-when-to-use-what-and-why/ Fri, 17 Feb 2023 14:40:32 +0000 https://thenewstack.io/?p=22700655

In just about every conversation on ingress controllers and service meshes, we hear some variation of the questions, “How is

The post API Gateway, Ingress Controller or Service Mesh: When to Use What and Why appeared first on The New Stack.

]]>

In just about every conversation on ingress controllers and service meshes, we hear some variation of the questions, “How is this tool different from an API gateway?” or “Do I need both an API gateway and an ingress controller (or service mesh) in Kubernetes?”

This confusion is understandable for two reasons:

  • Ingress controllers and service meshes can fulfill many API gateway use cases.
  • Some vendors position their API gateway tool as an alternative to using an ingress controller or service mesh — or they roll multiple capabilities into one tool.

Here, we will tackle how these tools differ and which to use for Kubernetes-specific API gateway use cases. For a deeper dive, including demos, watch the webinar “API Gateway Use Cases for Kubernetes.”

Definitions

At their cores, API gateways, ingress controllers and service meshes are each a type of proxy, designed to get traffic into and around your environments.

What Is an API Gateway?

An API gateway routes API requests from a client to the appropriate services. But a big misunderstanding about this simple definition is the idea that an API gateway is a unique piece of technology. It’s not. Rather, “API gateway” describes a set of use cases that can be implemented via different types of proxies, most commonly an ADC or load balancer and reverse proxy, and increasingly an ingress controller or service mesh. In fact, we often see users, from startup to enterprise, deploying out-of-the-box NGINX as an API gateway with reverse proxies, web servers or load balancers, and customizing configurations to meet their use case needs.

There isn’t a lot of agreement in the industry about what capabilities are “must haves” for a tool to serve as an API gateway. We typically see customers requiring the following abilities (grouped by use case):

Resilience Use Cases

  • A/B testing, canary deployments and blue-green deployments
  • Protocol transformation (between JSON and XML, for example)
  • Rate limiting
  • Service discovery

Traffic Management Use Cases

  • Method-based routing and matching
  • Request/response header and body manipulation
  • Request routing at Layer 7
  • Retries and keepalives

Security Use Cases

  • API schema enforcement
  • Client authentication and authorization
  • Custom responses
  • Fine-grained access control
  • TLS termination

Almost all these use cases are commonly used in Kubernetes. Protocol transformation and request/response header and body manipulation are less common since they’re generally tied to legacy APIs that aren’t well-suited for Kubernetes and microservices environments. They also tend to be indicative of monolithic applications that are less likely to run in Kubernetes.

What Is an Ingress Controller?

An ingress controller is a specialized Layer 4 and Layer 7 proxy that gets traffic into Kubernetes, to the services, and back out again (referred to as ingress-egress or north-south traffic). In addition to traffic management, ingress controllers can also be used for visibility and troubleshooting, security and identity, and all but the most advanced API gateway use cases.

What Is a Service Mesh?

A service mesh handles traffic flowing between Kubernetes services (referred to as service-to-service or east-west traffic). It is commonly used to achieve end-to-end encryption (E2EE) and for applying TLS to all traffic. A service mesh can be used as a distributed (lightweight) API gateway very close to the apps, made possible on the data plane level by service mesh sidecars.

Note: Choosing a service mesh is its own journey that is worth some consideration.

Use Kubernetes Native Tools for Kubernetes Environments

So how do you decide which tool is right for you? We’ll make it simple: If you need API gateway functionality inside Kubernetes, it’s usually best to choose a tool that can be configured using native Kubernetes config tooling such as YAML. Typically, that’s an ingress controller or service mesh. But we hear you saying, “My API gateway tool has so many more features than my ingress controller (or service mesh). Aren’t I missing out?” No! More features do not equal better tools, especially within Kubernetes where tool complexity can be a killer.

Note: “Kubernetes native (not the same as Knative) refers to tools that were designed and built for Kubernetes. Typically, they work with the Kubernetes CLI, can be installed using Helm and integrate with Kubernetes features.

Most Kubernetes users prefer tools they can configure in a Kubernetes native way because that avoids changes to the development or GitOps experience. A YAML-friendly tool provides three major benefits:

  • YAML is a familiar language to Kubernetes teams, so the learning curve is low or even nonexistent if you’re using an existing Kubernetes tool for API gateway functionality. This helps your teams work within their existing skill set without the need to learn how to configure a new tool that they might only use occasionally.
  • You can automate a YAML-friendly tool in the same fashion as your other Kubernetes tools. Anything that cleanly fits into your workflows will be popular with your team, increasing the probability that they use it.
  • You can shrink your Kubernetes traffic-management tool stack by using Kubernetes native tools already in the stack. Every extra hop matters, and there’s no reason to add unnecessary latency or single points of failure. And of course, reducing the number of technologies deployed within Kubernetes is also good for your budget and overall security.

North-South API Gateway Use Cases: Use an Ingress Controller

Ingress controllers have the potential to enable many API gateway use cases. In addition to the ones outlined in Definitions, we find organizations most value an ingress controller that can implement:

  • Offload of authentication and authorization
  • Authorization-based routing
  • Layer 7 level routing and matching (HTTP, HTTP/S, headers, cookies, methods)
  • Protocol compatibility (HTTP, HTTP/2, WebSocket, gRPC)
  • Rate limiting

Sample Scenario: Method-Level Routing

You want to implement method-level matching and routing using the ingress controller to reject the POST method in API requests.

Some attackers look for vulnerabilities in APIs by sending request types that don’t comply with an API definition — for example, sending POST requests to an API that is defined to accept only GET requests. Web application firewalls (WAF) can’t detect these kinds of attacks. They examine only request strings and bodies for attacks, so it’s best practice to use an API gateway at the ingress layer to block bad requests.

As an example, suppose the new API /coffee/{coffee-store}/brand was just added to your cluster. The first step is to expose the API using an ingress controller simply by adding the API to the upstreams field.

apiVersion: k8s.nginx.org/v1
kind: VirtualServer
metadata:
  name: cafe
spec:
  host: cafe.example.com
  tls:
    secret: cafe-secret
  upstreams:
  -name: tea
    service: tea-svc
    port: 80
  -name: coffee
    service: coffee-svc 
    port: 80


To enable method-level matching, you add a /coffee/{coffee-store}/brand path to the routes field and add two conditions that use the $request_method variable to distinguish between GET and POST requests. Any traffic using the HTTP GET method is passed automatically to the coffee service. Traffic using the POST method is directed to an error page with the message "You are rejected!" And just like that, you’ve protected the new API from unwanted POST traffic.

routes:
  - path: /coffee/{coffee-store}/brand
    matches:
    - conditions:
      - variable: $request_method
        value: POST
        action:
          return:
            code: 403
            type: text/plain
            body: "You are rejected!"
    - conditions:
      - variable: $request_method
        value: GET
        action:
          pass: coffee
  - path: /tea
    action:
      pass:tea


For more details on how you can use method-level routing and matching with error pages, check out these ingress controller docs. Additionally, you can dive into a security-related example of using an ingress controller for API gateway functionality.

East-West API Gateway Use Cases: Use a Service Mesh

A service mesh is not required, or even initially helpful, for most API gateway use cases because most of what you might want to accomplish can, and ought to, happen at the ingress layer. But as your architecture increases in complexity, you’re more likely to get value from using a service mesh. The use cases we find most beneficial are related to E2EE and traffic splitting, such as A/B testing, canary deployments and blue-green deployments.

Sample Scenario: Canary Deployment

You want to set up a canary deployment between services with conditional routing based on HTTP/S criteria.

The advantage is that you can gradually roll out API changes — such as new functions or versions — without affecting most of your production traffic.

Currently, your ingress controller routes traffic between two services managed by NGINX service mesh: Coffee.frontdoor.svc and Tea.frontdoor.svc. These services receive traffic from ingress controller and route it to the appropriate app functions, including Tea.cream1.svc. You decide to refactor Tea.cream1.svc, calling the new version Tea.cream2.svc. You want your beta testers to provide feedback on the new functionality, so you configure a canary traffic split based on the beta testers’ unique session cookie, ensuring your regular users only experience Tea.cream1.svc.

Using a service mesh, you begin by creating a traffic split between all services fronted by Tea.frontdoor.svc, including Tea.cream1.svc and Tea.cream2.svc. To enable the conditional routing, you create an HTTPRouteGroup resource (named tea-hrg) and associate it with the traffic split, the result being that only requests from your beta users (requests with the session cookie set to version=beta) are routed from Tea.frontdoor.svc to Tea.cream2.svc. Your regular users continue to experience only version 1 services behind Tea.frontdoor.svc.

apiVersion: split.smi-spec.io/v1alpha3
kind: TrafficSplit
metadata:
  name: tea-svc
spec:
  service: tea.1
  backends:
  - service: tea.1
    weight: 0
  - service: tea.2
    weight: 100
  matches:
  - kind: HTTPRouteGroup
    name: tea-hrg


apiVersion: specs.smi-spec.io/v1alpha3
kind: HTTPRouteGroup
metadata:
  name: tea-hrg
  namespace: default
spec:
  matches:
  - name: beta-session-cookie
    headers:
    - cookie: "version=beta"


This example starts your canary deployment with a 0-100 split, meaning all your beta testers experience Tea.cream2.svc, but of course you could start with whatever ratio that aligns to your beta-testing strategy. Once your beta testing is complete, you can use a simple canary deployment (without the cookie routing) to test the resilience of Tea.cream2.svc.

Check out these docs for more details on traffic splits with a service mesh. The above traffic split configuration is self-referential as the root service is also listed as a backend service. This configuration is not currently supported by the Service Mesh Interface specification (smi-spec) However, the spec is currently in alpha and subject to change.

When (and How) to Use an API Gateway Tool for Kubernetes Apps

Though most API gateway use cases for Kubernetes can (and should) be addressed by an ingress controller or a service mesh, there are some specialized situations where an API gateway tool is suitable.

Business Requirements

Using both an ingress controller and an API gateway inside Kubernetes can provide flexibility for organizations to achieve business requirements. Some scenarios include:

  • Your API gateway team isn’t familiar with Kubernetes and doesn’t use YAML. For example, if they’re comfortable with NGINX config, then it eases friction and lessens the learning curve if they deploy NGINX as an API gateway in Kubernetes.
  • Your Platform Ops team prefers to dedicate the ingress controller solution to app traffic management only.
  • You have an API gateway use case that only applies to one of the services in your cluster. Rather than using an ingress controller to apply a policy to all your north-south traffic, you can deploy an API gateway to apply the policy only where it’s needed.

Migrating APIs into Kubernetes Environments

When migrating existing APIs into Kubernetes environments, you can publish those APIs to an API gateway tool that’s deployed outside of Kubernetes. In this scenario, API traffic is typically routed through an external load balancer (for load balancing between clusters), then to a load balancer configured to serve as an API gateway, and finally to the ingress controller or Gateway API module within your Kubernetes cluster.

The Future of Gateway API for API Gateway Use Cases

This conversation would be incomplete without a brief discussion of the Kubernetes Gateway API (which is not the same as an API gateway). Generally seen as the future successor of the Ingress API, the Gateway API can be implemented for both north-south and east-west traffic. This means an implementation could perform ingress controller capabilities, service mesh capabilities or both. Ultimately, there’s potential for Gateway API implementations to act as an API gateway for all your Kubernetes traffic.

Gateway API is in beta, and there are numerous vendors, including NGINX, experimenting with implementations. It’s worth keeping a close eye on the innovation in this space and maybe even start experimenting with the beta version yourself.

Watch this short video to learn more about how an API gateway differs from the Gateway API:

https://youtu.be/GQOf4t4KGbw

Conclusion: Right Tools for the Right Use Case

For Kubernetes newcomers and even for folks with a decent amount of experience, APIs can be painfully confusing. We hope these rules of the road can provide guidance on how to build out your Kubernetes architecture effectively and efficiently.

Of course, your mileage may vary, and your use case or situation may be unique. But if you stick to Kubernetes native tools to simplify your tool stack and only consider using a separate API gateway (particularly outside of Kubernetes) for very specific situations like those outlined above, your journey should be much smoother.

The post API Gateway, Ingress Controller or Service Mesh: When to Use What and Why appeared first on The New Stack.

]]>
13 Years Later, the Bad Bugs of DNS Linger on https://thenewstack.io/13-years-later-the-bad-bugs-of-dns-linger-on/ Tue, 14 Feb 2023 17:00:09 +0000 https://thenewstack.io/?p=22699810

It’s 2023, and we are still copying code without fully debugging. Did we not learn from the Great DNS Vulnerability

The post 13 Years Later, the Bad Bugs of DNS Linger on appeared first on The New Stack.

]]>

It’s 2023, and we are still copying code without fully debugging. Did we not learn from the Great DNS Vulnerability of 2008? Fear not, internet godfather Paul Vixie has provided provides guidelines on how to do better.

Vixie, a Distinguished Engineer, and Vice President of Security at Amazon Web Services (and contributor to the internet before it was “The Internet”) spoke about the cost of open source dependencies in an Open Source Summit Europe in Dublin talk — which he revisited in a recent blog post. Both are highly recommended viewing and reading.

Flashback to 2008

In 2008, security expert Dan Kaminsky discovered a fundamental design flaw in DNS code that allowed for arbitrary cache poisoning that affected nearly every DNS server on the planet. The patch was released in July 2008 followed by the permanent solution, Domain Name Security Extensions (DNSSEC), in 2010. The Domain Name System is the basic name-based global addressing system for The Internet, so vulnerabilities in DNS could spell major trouble for pretty much everyone on the Internet.

Vixie and Kaminsky “set [their] hair on fire” to build the security vulnerability solution that “13 years later, is not widely enough deployed to solve this problem,” Vixie said. All of this software is open-source and inspectable but the DNS bugs are still being brought to Vixie’s attention in the present day.

“This is never going to stop if we don’t start writing down the lessons people should know before they write software,”  Vixie said.

How Did This Happen?

It’s our fault, “the call is coming from inside the house.” Before internet commercialization and the dawn of the home computer room, publishers of Berkley Software Distribution (BSD) of UNIX decided to support the then-new DNS protocol. “Spinning up a new release, making mag tapes, and putting them all in shipping containers was a lot of work” so they published DNS as a patch and posted to Usenet newsgroups, making it available to anyone who wanted it via an FTP server and mailing list.

When Vixie began working on DNS at Berkeley, DNS was for all intents abandonware insofar as all the original creators had since moved on. Since there was no concept of importing code and making dependencies, embedded systems vendors copied the original code and changed the API names to suit their local engineering needs… this sound familiar?

And then Linux came along. The internet E-X-P-L-O-D-E-D. You get an AOL account. And you get an AOL account…

Distros had to build their first C library and copied some version of the old Berkeley code whether they knew what it was or not. It was a copy of a copy that some other distro was using, they made a local version forever divorced from the upstream. DSL modems are an early example of this. Now the Internet of Things is everywhere and “all of this DNS code in all of the billions of devices are running on some fork of a fork of a fork of code that Berkeley published in 1986.”

Why does any of this matter? The original DNS bugs were written and shipped by Vixie. He then went on to fix them in the  90s but some still appear today. “For embedded systems today to still have that problem, any of those problems, means that whatever I did to fix it wasn’t enough. I didn’t have a way of telling people.”

Where Do We Go from Here?

“Sure would have been nice if we already had an internet when we were building one,” Vixie said. But, try as I might, we can’t go backward we can only go forward. Vixie made it very clear, “if you can’t afford to do these things [below] then free software is too expensive for you.”

Here is some of Vixie’s advice for software producers:

  • Do the best you can with the tools you have but “try to anticipate what you’re going to have.”
  • Assume all software has bugs “not just because it always has, but because that’s the safe position to take.” Machine-readable updates are necessary because “you can’t rely on a human to monitor a mailing list.”
  • Version numbers are must-haves for your downstream. “The people who are depending on you need to know something more than what you thought worked on Tuesday.” It doesn’t matter what it is as long as it uniquely identifies the bug level of the software.
  • Cite code sources in the README files in source code comments. It will help anyone using your code and chasing bugs.
  • Automate monitoring of your upstreams, review all changes, and integrate patches. “This isn’t optional.”
  • Let your downstream know about changes automatically “otherwise these bugs are going to do what the DNS bugs are doing.”

Here is the advice for software consumers:

  • Your software’s dependencies are your dependencies. “As a consumer when you import something, remember that you’re also importing everything it depends on… So when you check your dependencies, you’d have to do it recursively you have to go all the way up.”
  • Uncontracted dependencies can make free software incredibly expensive but are an acceptable operating risk because “we need the software that everybody else is writing.” Orphaned dependencies require local maintenance and therein lies the risk because that’s a much higher cost than monitoring the developments that are coming out of other teams. “It’s either expensive because you hire enough people and build enough automation or it’s expensive because you don’t.”
  • Automate dependency upgrades (mostly) because sometimes “the license will change from one you could live with to one that you can’t or at some point someone decides they’d like to get paid” [insert adventure here].
  • Specify acceptable version numbers. If versions 5+ have the fix needed for your software, say that to make sure you don’t accidentally get an older one.
  • Monitor your supply chain and ticket every release. Have an engineer review every update to determine if it’s “set my hair on fire, work over the weekend or we’ll just get to it when we get to it” priority level.

He closed with “we are all in this together but I think we could get it organized better than we have.” And it sure is one way to say it.

There is a certain level of humility and grace one has to have after being on the tiny team that prevented the potential DNS collapse, is a leader in their field for over a generation, but still has their early career bugs (that they solved 30 years ago) brought to their attention at regular intervals because adopters aren’t inspecting the source code.

The post 13 Years Later, the Bad Bugs of DNS Linger on appeared first on The New Stack.

]]>
EU Analyst: The End of the Internet Is Near https://thenewstack.io/eu-analyst-the-end-of-the-internet-is-near/ Thu, 02 Feb 2023 11:00:07 +0000 https://thenewstack.io/?p=22699205

The internet as we know it may no longer be a thing, warns a European Union-funded researcher. If it continues

The post EU Analyst: The End of the Internet Is Near appeared first on The New Stack.

]]>

The internet as we know it may no longer be a thing, warns a European Union-funded researcher. If it continues to fray, our favorite “network of networks” will just go back to being a bunch of networks again. And it will be the fault of us all.

“The idea of an open and global internet is progressively deteriorating and the internet itself is changing,” writes Konstantinos Komaitis, author of the report, “Internet Fragmentation: Why It Matters for Europe” posted Tuesday by the EU Cyber Diplomacy Initiative.

In short, the global and open nature of the internet is being impacted by larger geo-political forces, perhaps beyond everyone’s control. “Internet fragmentation must be seen both as a driver and as a reflection of an international order that is increasingly growing fragmented,” Komaitis concluded.

The vision for the internet has always been one of end-to-end communications, where one end device on the internet can exchange packets with any other end device, regardless of what network either one of them was on. And, by nature, the internet was meant to be open, with no central governing authority, allowing everyone in the world to join, for the benefit of all, rich or poor.

In practice, these technical and ideological goals may have played out inconsistently (NAT… cough), but the internet has managed to keep on keeping on for a remarkably long time for such a minimally-managed effort.

Yet, this may not always be the case, Komaitis foretells.

He notes the internet is besieged from all sides by potential fragmentation: from commercial pressures, technical changes and government interference. Komaitis highlighted a few culprits:

  • DNS: The Domain Name System is the index that holds everything together, mapping domain names to IP numbers. The Internet Corporation for Assigned Names and Numbers (ICANN) manages this work on a global scale, but there’s nothing to stop another party from setting up an alternative root server. A few have tried: The International Telecommunications Union’s Digital Object Architecture (DOA) as well as Europe’s Network and Information Systems both set out to challenge the global DNS.
  • Stalled IPv4 to IPv6 translation: The effort to move the internet from the limited IPv4 addressing scheme to the much larger IPv6 address pool has been going on for well over two decades now, with only limited success thus far. “Even though there is a steady increase in the adoption of IPv6 addresses, there is still a long way to go,” Komaitis for writes. He notes that “Just 32 economies” have IPv6 adoption rates above the global average of 30%. Without full IPv6 adoption, he argues, the internet will continue to e fragmented, with no assurance of end-to-end connectivity across those using one version or the other.
  • Internet content blocking: Governments have take an increasing interest in curating the internet for its own citizens, using tools such as DNS filtering, IP blocking, distributed denial of service (DDoS) attacks and search result removals. The most prominent example is China, which runs “a sophisticated filtering system that can control which content users are exposed to,” Komaitis wrote.
  • Breakdown of peering agreements: The internet is the result of a set of bilateral peering agreements, which allow very small internet service providers to share the address space with global conglomerates. Increasingly, however, the large telcos are prioritizing their own traffic at the expense of smaller players. The European Union is looking at ways to restructure these agreements,  though South Korea tried this, and the results ended up just confusing and burdening the market, Komaitis wrote.

Other mitigating factors that Komaitis discussed include wall gardens, data localization practices (i.e. GDPR) and ongoing governmental interest/interference in open standards bodies.

What does all this mean for the European Union, which funded this overview? The Union has already pledged to offer everyone online access by 2030, as well as to thwart any commercial of government attempts to throttle or prioritize internet traffic. It has also made a pledge, with the U.S. and other governments to ensure the Internet ” is “open, global and interoperable.”

So the EU needs to make the choice of whether or not to back its pledges.

“Moving forward, Europe must make a choice as to what sort of internet it wants: an open, global, interoperable internet or one that is fragmented and limited in choice?” Komaitis wrote.

The EU Cyber Diplomacy Initiative is “an EU-funded project focused on policy support, research, outreach and capacity building in the field of cyber diplomacy,” according to the project’s site.

The post EU Analyst: The End of the Internet Is Near appeared first on The New Stack.

]]>
Turbocharging Host Workloads with Calico eBPF and XDP https://thenewstack.io/turbocharging-host-workloads-with-calico-ebpf-and-xdp/ Fri, 27 Jan 2023 18:36:56 +0000 https://thenewstack.io/?p=22698818

In Linux, network-based applications rely on the kernel’s networking stack to establish communication with other systems. While this process is

The post Turbocharging Host Workloads with Calico eBPF and XDP appeared first on The New Stack.

]]>

In Linux, network-based applications rely on the kernel’s networking stack to establish communication with other systems. While this process is generally efficient and has been optimized over the years, in some cases it can create unnecessary overhead that can affect the overall performance of the system for network-intensive workloads such as web servers and databases.

XDP (eXpress Data Path) is an eBPF-based high-performance datapath inside the Linux kernel that allows you to bypass the kernel’s networking stack and directly handle packets at the network driver level.

XDP can achieve this by executing a custom program to handle packets as they are received by the kernel. This can greatly reduce overhead, improve overall system performance and improve network-based applications by shortcutting the normal networking path of ordinary traffic.

However, using raw XDP can be challenging due to its programming complexity and the high learning curve involved. Solutions like Calico Open Source offer an easier way to tame these technologies.

Calico Open Source is a networking and security solution that seamlessly integrates with Kubernetes and other cloud orchestration platforms. While infamous for its policy engine and security capabilities, there are many other features that can be used in an environment by installing Calico. These include routing, IP address management and a pluggable data plane with various options such as eBPF (extended Berkeley Packet Filter), IPtables and Vector Packet Processor (VPP).

In addition to these features, Calico also makes it simple to create and load custom XDP programs to target your cluster host interfaces.

The power of XDP can be used to improve the performance of your host workloads by using the same familiar Kubernetes-supported syntax that you use to manage your cluster resources every day.

Calico’s integration with XDP works and is implemented in the same way for any cluster running Linux and Calico, whether it uses IPtables, IP Virtual Server (IPVS) or Calico’s eBPF data plane.

By the end of this article, you will know how to harness the power of XDP programs to accelerate your host workloads without needing to learn any programming languages. You will also learn more about these technologies, how Calico offers an easier way to adapt to the ever-changing cloud computing scene and how to use these cutting-edge technologies to boost your cluster performance.

XDP Workload Acceleration

One of the main advantages of XDP for high-connection workloads is its ability to operate at line rate with low overhead for the system. Because XDP is implemented directly in the kernel at the earliest possible execution point, it can process packets at very high speeds with minimal latency. This makes it well suited for high-performance networking applications that require fast and efficient packet processing.

The following image illustrates the packet flow in the Linux kernel.

Fig 1: XDP and general networking packet flow

https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg

To better understand the advantages of XDP, let’s examine it in a common setup by running an in-memory database on a Kubernetes host and using XDP to bypass the Linux conntrack features.

Host Network Security

A host-based workload is a type of application that runs directly on the host machine. It is often used to describe workloads that are not deployed in a container orchestration platform like Kubernetes, but rather directly on a host.

The following code block illustrates a host workload network socket.

This type of workload is quite difficult to secure by using the ordinary Kubernetes network policy (KNP) resource, since host workloads do not belong to any of the namespaces that Kubernetes moderates. In fact, one of the shortcomings of KNP is its limitation in addressing these types of traffic. But don’t worry, the modular nature of Kubernetes allows us to use Container Networking Interface (CNI) plugins, such as Calico, to address such issues.

Calico Host Endpoint Policy (HEP)

Calico host endpoint policy (HEP) is a Kubernetes custom resource definition that enables the manipulation of host traffic within a cluster. A HEP in Calico represents a host participating in a cluster and the traffic generated by the workload on that host. HEPs can be associated with host network cards and allow you to apply network security policies to the traffic generated by the workload on the host.

A host endpoint policy is defined using the HostEndpoint resource in Calico, and it has the following structure:

The metadata field contains information about the host endpoint, such as its name and any labels that are associated with it, similar to other Kubernetes resources, these labels can be used with other resources to reference the traffic to or from a HEP.

The spec field contains the configuration for the HEP, including the name of the interface that it is associated with, the node on which it is running and the expected IP addresses of the designated Kubernetes node network interface card.

Using HostEndpoint with Security Policies

Similar to other Kubernetes security resources, HEP will become a deny-all behavior and impose a lockdown on your cluster in the absence of an explicit permit that requires a preemptive look at the traffic that should be allowed in your cluster before implementing such a resource. But on top of its security advantages, you can refer to HEP labels from other Calico security resources such as global security policies to control traffic that might otherwise be difficult to control and create more complex and fine-grained security rules for your cluster environment.

The following selector in a security policy would allow you to filter packets that are associated with the redis-host host endpoint policy:

selector: has(hep) && hep == "redis-host"

This selector matches packets that have the hep label with a value of redis-host. You can use this selector in combination with other rules in a security policy to specify how the matched packets should be treated by your CNI. In the next section, we will use the same logic to bypass the Linux conntrack feature.

Connection Tracking in Linux

By default, networking in Linux is stateless, meaning that each incoming and outgoing traffic flow must be specified before it can be processed by the system networking stack. While this provides a strong security measure, it can also add complexity in certain situations. To address this, conntrack was developed.

Conntrack, or connection tracking, is a core feature of the Linux kernel used by technologies such as stateful firewalls. It allows the kernel to keep track of all logical network connections or flows by maintaining a list of the state of each connection, such as the source and destination IP addresses, protocol and port numbers.

This list has an adjustable soft limit, meaning that it can expand as needed to accommodate new connections. However, in some cases, such as with short-lived connections, conntrack can become a bottleneck for the system and affect performance.

For example, this is the limit on my local computer.

While it is possible to dig deeper into the conntrack table details from the /proc/sys/net/netfilter path in your system, applications such as conntrack can give you a more organized view of these records.

The following code block illustrates recorded entries by my local Kubernetes host.

In addition to securing a cluster, a host endpoint selector can be added to a Calico global security policy with a doNotTrack value to bypass Linux tracker capabilities for specific flows. This method could be beneficial for network-intensive workloads that receive a massive number of short-lived requests, and prevent the Linux conntrack table from overflowing.

The following code block is a policy example that bypasses the Linux conntrack table.

It is worth noting that since doNotTrack disables Linux conntrack capabilities, any traffic that matches the previous policy will become a stateless connection. In order to return it to its originating source, we have to specifically add a return policy (egress) to our Calico global security policy resource.

Calico’s workload acceleration is not limited to XDP. In fact, Calico’s eBPF data plane can be used to provide acceleration for other types of workloads in a cluster. If you would like to learn more about Calico’s eBPF data plane, please click here.

Conclusion

Overall, eBPF and XDP are powerful technologies that offer significant advantages for high-connection workloads, including high performance, low overhead and programmability.

In this article, we established how Calico makes it easier to take advantage of these technologies without worrying about the complexities associated with the learning curve involved with these complicated technologies.

Check out our free self-paced workshop, “Turbocharging host workloads with Calico eBPF and XDP,” to learn more.

The post Turbocharging Host Workloads with Calico eBPF and XDP appeared first on The New Stack.

]]>
Azure Went Dark https://thenewstack.io/azure-went-dark/ Wed, 25 Jan 2023 21:06:13 +0000 https://thenewstack.io/?p=22698744

And down went all Microsoft 365 services around the world. One popular argument against putting your business trust in the

The post Azure Went Dark appeared first on The New Stack.

]]>

And down went all Microsoft 365 services around the world.

One popular argument against putting your business trust in the cloud is that if your hyper-cloud provider goes down, so does your business. Well, on the early U.S. East coast morning, it happened. Microsoft Azure went down and along with it went Microsoft 365, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, GitHub, Microsoft Authenticator, and Teams. In short, pretty much everything running on Azure went boom.

Azure’s status page revealed the outage hit everything in the Americas, Europe, Asia-Pacific, the Middle East, and Africa. The only area to avoid the crash was China.

First Report

Microsoft first reported the problem at 2:31 a.m. Eastern, just as Europe was getting to work. The Microsoft 365 Status Twitter account reported, “We’re investigating issues impacting multiple Microsoft 365 services.”

Of course, by that time, users were already screaming. As one Reddit user on the sysadmin subreddit, wrote, “Move it to the cloud, they said, it will never go down, they said, we will save so much money they said.”

The Resolution

Later, Microsoft reported, “We’ve rolled back a network change that we believe is causing impact. We’re monitoring the service as the rollback takes effect.” By 9:31 a.m., Microsoft said the disaster was over. “We’ve confirmed that the impacted services have recovered and remain stable.” But, “We’re investigating some potential impact to the Exchange Online Service.” So, Exchange admins and users? Don’t relax just yet.

What Caused It?

So, what really caused it? Microsoft isn’t saying, but my bet, as a former network administrator, is it was either a Domain Name System (DNS) or Border Gateway Protocol (BGP) misconfiguration. Given the sheer global reach of the failure across multiple Azure Regions, I’m putting my money on BGP.

The post Azure Went Dark appeared first on The New Stack.

]]>
Performance Measured: How Good Is Your WebAssembly? https://thenewstack.io/performance-measured-how-good-is-your-webassembly/ Thu, 19 Jan 2023 16:00:25 +0000 https://thenewstack.io/?p=22697175

WebAssembly adoption is exploding. Almost every week at least one startup, SaaS vendor or established software platform provider is either

The post Performance Measured: How Good Is Your WebAssembly? appeared first on The New Stack.

]]>

WebAssembly adoption is exploding. Almost every week at least one startup, SaaS vendor or established software platform provider is either beginning to offer Wasm tools or has already introduced Wasm options in its portfolio, it seems. But how can all of the different offerings compare performance-wise?

The good news is that given Wasm’s runtime simplicity, the actual performance at least for runtime can be compared directly among the different WebAssembly offerings. This direct comparison is certainly much easier to do when benchmarking distributed applications that run on or with Kubernetes, containers and microservices.

This means whether a Wasm application is running on a browser, an edge device or a server, the computing optimization that Wasm offers in each instance is end-to-end and, and its runtime environment is in a tunnel of sorts — obviously good for security — and not affected by the environments in which it runs as it runs directly on a machine level on the CPU.

Historically, Wasm has also been around for a while, before the World Wide Web Consortium (W3C) named it as a web standard in 2019, thus becoming the fourth web standard with HTML, CSS and JavaScript. But while web browser applications have represented Wasm’s central and historical use case, again, the point is that it is designed to run anywhere on a properly configured CPU.

In the case of a PaaS or SaaS service in which Wasm is used to optimize computing performance — whether it is running in a browser or not — the computing optimization that Wasm offers in runtime can be measured directly between the different options.

At least one application is increasingly being adopted that can be used to benchmark the different runtimes, compilers and JITs of the different versions of Wasm: libsodium. While anecdotal, this writer has contacted at least 10 firms that have used it or know of it.

Libsodium consists of a library for encryption, decryption, signatures, password hashing and other security-related applications. Its maintainers describe it in the documentation as a portable, cross-compilable, installable and packageable fork of NaCl, with an extended API to improve usability.

Since its introduction, the libsodium benchmark has been widely used to measure to pick the best runtimes, a cryptography engineer Frank Denis, said. Libsodium includes 70 tests, covering a large number of optimizations code generators can implement, Denis noted. None of these tests perform any kind of I/O (disk or network), so they are actually measuring the real efficiency of compilers and runtimes, in a platform-independent way. “Runtimes would rank the same on a local laptop and on a server in the cloud,” Denis said.

Indeed, libsodium is worthwhile for testing some Wasm applications, Fermyon Technologies CEO and co-founder Matt Butcher told the New Stack. “Any good benchmark tool has three desirable characteristics: It must be repeatable, fair (or unbiased toward a particular runtime), and reflective of production usage,” Butcher said. “Libsodium is an excellent candidate for benchmarking. Not only is cryptography itself a proper use case, but the algorithms used in cryptography will suss out the true compute characteristics of a runtime.”

Libsodium is also worthwhile for testing some Wasm environments because it includes benchmarking tasks with a wide range of different requirement profiles, some probing for raw CPU or memory performance, while others check for more nuanced performance profiles,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack. “The current results show the suite’s ability to reveal significant differences in performance between the various runtimes, both for compiled languages and for interpreted ones,” Volk said. “Comparing this to the performance of apps that run directly on the operating system, without WASM in the middle, provides us with a good idea of the potential for future optimization of these runtimes.”

True Specs

In a blog post. Denis described how different Wasm apps were benchmarked in tests he completed. They included:

  • Iwasm, which is part of the WAMR (“WebAssembly micro runtime”) package — pre-compiled files downloaded from their repository.
  • Wasm2c, included in the Zig source code for bootstrapping the compiler.
  • Wasmer 3.0, installed using the command shown on their website. The three backends have been individually tested.
  • Wasmtime 4.0, compiled from source.
  • Node 18.7.0 installed via the Ubuntu package.
  • Bun 0.3.0, installed via the command show on their website.
  • Wazero from git rev 796fca4689be6, compiled from source.

Which one came out on top in the runtime tests? Iwasm, which is part of WebAssembly Micro Runtime (WAMR), according to Denis’ results. The iwasm VM core is used to run WASM applications. It supports interpreter mode, ahead-of-time compilation (AOT) mode and just-in-time compilation (JIT) modes, LLVM JIT and Fast JIT, according to the project’s documentation.

This does not mean that iwasm wins accolades for simplicity of use. “Compared to other options, [iwasm] is intimidating,” Denis wrote. “It feels like a kitchen sink, including disparate components.” These include IDE integration, an application framework library’s remote management and an SDK “that makes it appear as a complicated solution to simple problems. The documentation is also a little bit messy and overwhelming,” Denis writes.

Runtime Isn’t Everything

Other benchmarks exist to gauge the differences in performance among different Wasm alternatives. Test alternatives that Denis communicated include:

However, benchmarking runtime performance is not an essential metric for all WebAssembly applications. Other test alternatives exist to test different Wasm runtimes that focus on very specific tasks, such as calculating the Fibonacci sequence, sorting data arrays or summing up integers, Volk noted. There are other more comprehensive benchmarks consisting of the analysis of entire use cases, such as video processing, editing of PDF, or even deep learning-based object recognition, Volk said.

“Wasm comes with the potential of delivering near-instant startup and scalability and can therefore be used for the cost-effective provisioning and scaling of network bandwidth and functional capabilities,” Volk said. “Evaluating this rapid startup capability based on specific use case requirements can show the direct impact of a Wasm runtime on the end-user experience.”

Some Wasm applications are used in networking to improve latency. Runtime performance is important, of course, but it is the latency performance that counts in this case, Sehyo Chang, chief technology officer at InfinyOn, said. This is because, Chang said, latency plays a crucial role in determining the overall user experience in any application. “A slow response time can greatly impact user engagement and lead to dissatisfaction, potentially resulting in lost sales opportunities,” Chang said.

During a recent KubeCon + CloudNativeCon conference, Chang gave a talk about using Wasm to replace Kafka for lower latency data streaming. Streaming technology based on Java, like Kafka, can experience high latency due to Garbage collection and JVM, Chang said. However, using WebAssembly (WASM) technology allows for stream processing without these penalties, resulting in a significant reduction of latency while also providing more flexibility and security, Chang said.

The post Performance Measured: How Good Is Your WebAssembly? appeared first on The New Stack.

]]>
How to Overcome Challenges in an API-Centric Architecture https://thenewstack.io/how-to-overcome-challenges-in-an-api-centric-architecture/ Mon, 09 Jan 2023 17:00:50 +0000 https://thenewstack.io/?p=22697222

This is the second in a two-part series. For an overview of a typical architecture, how it can be deployed

The post How to Overcome Challenges in an API-Centric Architecture appeared first on The New Stack.

]]>

This is the second in a two-part series. For an overview of a typical architecture, how it can be deployed and the right tools to use, please refer to Part 1

Most APIs impose usage limits on number of requests per month and rate limits, such as a maximum of 50 requests per minute. A third-party API can be used by many parts of the system. Handling subscription limits requires the system to track all API calls and raise alerts if the limit will be reached soon.

Often, increasing the limit requires human involvement, and alerts need to be raised well in advance. The system deployed must be able to track API usage data persistently to preserve data across service restarts or failures. Also, if the same API is used by multiple applications, collecting those counts and making decisions needs careful design.

Rate limits are more complicated. If handed down to the developer, they will invariably add sleep statements, which will solve the problem in the short term; however, in the long run, this leads to complicated issues when the timing changes. A better approach is to use a concurrent data structure that limits rates. Even then, if the same API is used by multiple applications, controlling rates is more complicated.

An option is to assign each API a portion of the rates, but the downside of that is some bandwidth will be wasted because while some APIs are waiting for capacity, others might be idling. The most practical solution is to send all calls through an outgoing proxy that can handle all limits.

Apps that use external APIs will almost always run into this challenge. Even internal APIs will have the same challenge if they are used by many applications. If an API is only used by one application, there is little point in making that an API. It may be a good idea to try to provide a general solution that handles subscription and rate limits.

Overcoming High Latencies and Tail Latencies

Given a series of service calls, tail latencies are the few service calls that take the most time to finish. If tail latencies are high, some of the requests will take too long or time out. If API calls happen over the internet, tail latencies keep getting worse. When we build apps combining multiple services, each service adds latency. When combining several services, the risk of timeouts increases significantly.

Tail latency is a topic that has been widely discussed, which we will not repeat. However, it is a good idea to explore and learn this area if you plan to run APIs under high-load conditions. See [1], [2], [3], [4] and [5] for more information.

But, why is this a problem? If the APIs we expose do not provide service-level agreement (SLA) guarantees (such as in the 99th percentile in less than 700 milliseconds), it would be impossible for downstream apps that use our APIs to provide any guarantees. Unless everyone can stick to reasonable guarantees, the whole API economy will come crashing down. Newer API specifications, such as the Australian Open Banking specification, define latency limits as part of the specification.

If the use case allows it, the best option is to make tasks asynchronous.

There are several potential solutions. If the use case allows it, the best option is to make tasks asynchronous. If you are calling multiple services, it inevitably takes too long, and often it is better to set the right expectations by promising to provide the results when ready rather than forcing the end user to wait for the request.

When service calls do not have side effects (such as search), there is a second option: latency hedging, where we start a second call when the wait time exceeds the 80th percentile and respond when one of them has returned. This can help control the long tail.

The third option is to try to complete as much work as possible in parallel by not waiting for a response when we are doing a service call and parallelly starting as many service calls as possible. This is not always possible because some service calls might depend on the results of earlier service calls. However, coding to call multiple services in parallel and collecting the results and combining them is much more complex than doing them one after the other.

When a timely response is needed, you are at the mercy of your dependent APIs. Unless caching is possible, an application can’t work faster than any of its dependent services. When the load increases, if the dependent endpoint can’t scale while keeping the response times within the SLA, we will experience higher latencies. If the dependent API can be kept within the SLA, we can get more capacity by paying more for a higher level of service or by buying multiple subscriptions. When that is possible, keeping within the latency becomes a capacity planning problem, where we have to keep enough capacity to manage the risk of potential latency problems.

Another option is to have multiple API options for the same function. For example, if you want to send an SMS or email, there are multiple options. However, it is not the same for many other services. It is possible that as the API economy matures, there will be multiple competing options for many APIs. When multiple options are available, the application can send more traffic to the API that responds faster, giving it more business.

If our API has one client, then things are simple. We can let the client use the API as far as our system allows. However, if we are supporting multiple clients, we need to try to reduce the possibility of one client slowing down others. This is the same reason why other APIs will have a rate limit. We should also define rate limits in our API’s SLA. When a client sends too many requests too fast, we should reject their requests using a status code such as HTTP status code 503. Doing this communicates to the client that it must slow down. This process is called backpressure, where we communicate to upstream clients that the service is overloaded and the message will eventually be handed out to the end user.

It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs.

If we are overloaded without any single user sending requests too fast, we need to scale up. If we can’t scale up, we still need to reject some requests. It is important to note that rejecting requests, in this case, makes our system unavailable, while rejecting requests in the earlier case where one client is going over his SLA does not count as unavailable time.

Cold start times (the time for the container to boot up) and service requests are other latency sources. A simple solution is to keep a replica running at all times; this is acceptable for high-traffic APIs. However, if you have many low-traffic APIs, this could be expensive. In such cases, you can guess the traffic and warm up the container before (using heuristics, AI or both). Another option is to optimize the startup time of the servers to allow for fast bootup.

Latency, scale and high availability are closely linked. Even a well-tuned system would need to scale to keep the system running within acceptable latency. If our APIs need to reject valid requests due to load, the API will be unavailable from the user’s point of view.

Managing Transactions across Multiple APIs

If you can run all code from a single runtime (such as JVM), we can commit it as one transaction. For example, premicroservices-era monolithic applications could handle most transactions directly with the database. However, as we break the logic across multiple services (hence multiple runtimes), we cannot carry a single database transaction across multiple service invocations without doing additional work.

One solution for this has been programming language-specific transaction implementations provided by an application server (such as Java transactions). Another is using Web Service atomic transactions if your platform supports it. Yet another has been to use a workflow system (such as Ode or Camunda), that has support for transactions. You can also use queues and combine database transactions and queue system transactions into a single transaction through a transaction manager like Atomikos.

This topic has been discussed in detail under microservices, and we will not repeat those discussions here. Please refer to [6], [7] and [8] for more details

Finally, with API-based architectures, troubleshooting is likely more involved. It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs. Also, we need clear data we can share in case help is needed from a third-party API to isolate and fix the problem.

I would like to thank Frank Leymann, Eric Newcomer and others for their thoughtful feedback to significantly shape these posts.

The post How to Overcome Challenges in an API-Centric Architecture appeared first on The New Stack.

]]>
How to Use Time-Stamped Data to Reduce Network Downtime  https://thenewstack.io/how-to-use-time-stamped-data-to-reduce-network-downtime/ Mon, 09 Jan 2023 14:41:54 +0000 https://thenewstack.io/?p=22697210

Increased regulations and emerging technologies forced telecommunications companies to evolve quickly in recent years. These organizations’ engineers and site reliability

The post How to Use Time-Stamped Data to Reduce Network Downtime  appeared first on The New Stack.

]]>

Increased regulations and emerging technologies forced telecommunications companies to evolve quickly in recent years. These organizations’ engineers and site reliability engineering (SRE) teams must use technology to improve performance, reliability and service uptime. Learn how WideOpenWest uses a time series platform to monitor its entire service delivery network

Trends in the Telecommunications Industry 

Telecommunication companies are facing challenges that vary depending on where the company is in their life cycle. Across the industry, businesses must modernize their infrastructure while also maintaining legacy systems. At the same time, new regulations at both the local and federal levels increase the competition within the industry, and new businesses challenge the status quo set by current industry leaders.

In recent years, the surge in people working from home requires a more reliable internet connection to handle their increased network bandwidth needs. The increased popularity of smartphones and other devices means there are more devices requiring network connectivity — all without a reduction in network speeds. Latency issues or poor uptime lead to unhappy customers, who then become flight risks. Add to this situation more frequent security breaches, which then  requires all businesses to monitor their networks to detect potential breaches faster.

Challenges to Modernizing Networks

Founded in 1996 in Denver, Colorado, WideOpenWest (WOW) provides internet, video and voice services in various markets across the United States. Over the years, WOW acquired various telecommunication organizations, and as its network expanded, it needed a better network monitoring tool to address a growing list of challenges. For instance, WOW engineers wanted to be able to analyze an individual customer’s cable modem, determine the health of a node and understand the overall state of the network. However, several roadblocks prevented the company from doing so. WideOpenWest already used multiple monitoring platforms internally, and the cost of purchasing hardware that aids in monitoring individual nodes was too expensive. It already had a basic process in place to collect telemetry data from specific modems, but there was no single source of truth to tie everything together.

Using Time Series Data to Reduce Network Latency 

A few years ago, WideOpenWest decided to replace its legacy time series database, and after considering other solutions, it chose InfluxDB, the purpose-built time series database. It now has a four-node cluster of InfluxDB Enterprise in production and a two-node cluster running on OpenStack for testing. The team uses Ansible to automate cluster setup and installation.

The primary motivations for using InfluxDB are to improve overall observability of the entire network and to implement better alerting. The WOW engineers use Telegraf for data collection whenever possible because it integrates easily with all the other systems. Some legacy hardware requires them to use Filebeats, custom scripts and vendor APIs.

They make extensive use of Simple Network Management Protocol (SNMP) polling and traps in the data collection process because that remains an industry standard, despite its age. Specifically, they use SNMP to collect metrics from cable modems and Telegraf to collect time-stamped data from their virtual machines and containers. Using InfluxDB provided the team with the necessary flexibility to work around restrictions from vendor-managed systems, and they now collect data from all desired sources.

Next they stream the data to Kafka to better control data input and output. Kafka also allows them to easily consume or move data into different regions or systems, if necessary. From the Kafka cluster, they use Telegraf to send data to their InfluxDB Enterprise cluster.

WOW’s team aggregates various metrics from the fiber-to-the-node network, such as:

  • Telemetry metrics, like usage and uptime, from over 650,000 cable modems on a five-minute polling cycle.
  • Status of all television channels upstream and downstream, including audio and visual signal strength and outages.
  • Average signal, port and power levels.
  • Signal-to-noise ratio (SNR) — used to ensure the highest level of wireless functionality.
  • Modulation error ratio (MER) — another measurement used to understand signal quality that factors in the amount of interference occurring on the transmission channel.

The WOW team uses all this data to gain insights from real-time analytics to create visualizations and to trigger alerts and troubleshoot processes. Once the data is in InfluxDB, they use Grafana for all their visualizations. They also leverage InfluxDB’s alerting frameworks to send alerts via ServiceNow, Slack and email. Adopting InfluxDB allowed the WOW team to implement an Infrastructure-as-Code (IaC) system, so instead of spending time manually managing their infrastructure, they can write config files to simplify processes.

Diagram 1: WideOpenWest’s InfluxDB implementation

WideOpenWest’s next big project is to implement a full CI/CD pipeline with automated code promotions. With this, they hope to improve automated testing. WOW also wants to streamline all monitoring across the organization and increase the level of infrastructure monitoring.

The post How to Use Time-Stamped Data to Reduce Network Downtime  appeared first on The New Stack.

]]>
The Right Stuff for Really Remote Edge Computing https://thenewstack.io/the-right-stuff-for-really-remote-edge-computing/ Fri, 06 Jan 2023 15:25:57 +0000 https://thenewstack.io/?p=22697059

Suppose you operate popup clinics in rural villages and remote locations where there is no internet. You need to capture

The post The Right Stuff for Really Remote Edge Computing appeared first on The New Stack.

]]>

Suppose you operate popup clinics in rural villages and remote locations where there is no internet. You need to capture and share data across the clinic to provide vital healthcare, but if the apps you use require an internet connection to work, they can’t operate in these areas.

Or perhaps you’re an oil and gas operator that needs to analyze critical warning data from a pressure sensor on a platform in the North Sea. If the data needs to be processed in cloud data centers, it has to travel incredible distances — at great expense — over unreliable networks. This incurs high degrees of latency, or network slowness, so by the time a result is sent back to the platform, it could be too late to take any action.

These kinds of use cases represent a growing class of apps that require 100% uptime and real-time speed, guaranteed — regardless of where they are operating in the world.

A fundamental challenge in meeting these requirements remains the network — there are still huge swaths of the globe with little or no internet — meaning apps that depend on connectivity cannot operate in those areas.

Emerging advances in network technology are closing those gaps, but no matter the coverage, reliability or speed of a network, it will inevitably suffer slowness and outages that affect the applications that rely on it, resulting in a poor user experience and business downtime.

The Responsible Development Choice

How do you guarantee availability and ultra-low latency for apps, especially when operating in internet dead zones? This is made possible by understanding the challenges in network connectivity and working around them.

The responsible development choice is to architect and build applications that:

  • Can still operate when network connectivity is interrupted or unavailable.
  • Can make the most efficient use of network connectivity when it is available, because it can be fleeting and may not always be fast.

To do this, you must bring data processing and compute infrastructure to the near side of the network — that is, to the literal edge, such as in the popup clinic van or on the oil platform, reducing dependencies on distant cloud data centers.

Taking It to the Edge

Cloud Architecture

A cloud computing architecture assumes that data storage and processing is hosted in the cloud. In this depiction, application services and the database are hosted and running in the cloud, accessed from edge devices via REST calls:

The cloud architecture depends on the internet for apps to operate properly. If there is any network slowness or interruption, the apps will slow or stop.

Edge Architecture

Edge computing architectures bring data processing to the edge, close to applications, which makes them faster because data doesn’t have to travel all the way to the cloud and back. And it makes them more reliable because local data processing means they can operate even without the internet.

It’s not about getting rid of the cloud; you still need that eventual aggregation point. It’s about extending the cloud to the near side of the network. Edge architectures use the network for synchronization, where data is synced across the application ecosystem when connectivity is available.

And it’s important to note that by “sync ” we mean something more than just using the network to replicate data. It’s also about using the precious and fleeting bandwidth as efficiently as possible when it’s available.

Sync technology provides cross-record compression, delta compression, batching, filtering, restartability and more — and because of these efficiencies it pushes less data through the wire, which is critical when on slow, unreliable or shared bandwidth networks.

Simply put, an edge architecture allows you to:

  • Capture, store and process data where it happens, providing availability and speed.
  • Sync data securely and efficiently throughout the app ecosystem as connectivity allows, providing consistency.

Now let’s explore how to adopt an edge architecture.

All You Have to Do Is ASC

Over the past couple of years, we’ve seen a growth of next-gen technologies designed to make applications more available in more places and for more users than ever before.

These advances are lowering the bar and making it easier for organizations to adopt edge architectures to guarantee speed, uptime and efficient bandwidth use for applications, especially those that operate in remote locations and internet dead zones.

To build an edge architecture, you need four fundamental system components:

  1. A cloud compute environment.
  2. An edge compute environment.
  3. A network connecting the cloud and the edge.
  4. A database that synchronizes from the cloud to the edge.

Here we combine three state-of-the-art technologies to create an edge architecture that can operate at high speed, all the time, anywhere on the planet.

We call it the ASC stack:

  • AWS Snowball
  • SpaceX Starlink
  • Couchbase Capella

What Is AWS Snowball?

AWS Snowball is a service that provides secure, portable, rugged devices (called AWS Snowball Edge devices) that run AWS infrastructure for powering applications at the edge.

AWS Snowball Edge Device (Source: Amazon)

The devices are about the size of a suitcase and deliver local computing, data processing and data storage for disconnected environments such as ships, mines, oil platforms, field clinics and remote manufacturing facilities. Wherever AWS infrastructure is required but unfeasible due to lack of reliable internet, Snowball provides a portable solution.

Described in simpler terms, Snowball is an “AWS-data-center-in-a-box” that arrives at your door preconfigured with AWS services and ready to go. It supports AWS S3, EC2, Lambda, EBS and more. You plug it in, then access and manage the environment via the AWS Control Plane over local networks.

By providing a portable, familiar, standards-based infrastructure, AWS Snowball makes it easy for anyone to set up and run edge data centers without worrying about internet connectivity.

What Is SpaceX Starlink?

Starlink is a next-gen satellite internet service from SpaceX. It is made up of “constellations” of thousands of small satellites in low earth orbit — about 340 miles in space. This is opposed to traditional large-scale geostationary satellites that orbit in a fixed location about 22,000 miles up.

Because of the shorter physical distance between the customer’s dish and the satellite, Starlink can deliver 20- to 50-millisecond latency on average, which is much faster than traditional satellite internet (which, due to the greater distance, can suffer latencies up to 600 milliseconds or more).

The lower orbit and smart network technology allows Starlink to offer performance comparable to terrestrial networks. Their “Business” service offers download speeds of up to 350 Mbps and latency of 20-40 ms.

While Starlink provides vital internet connectivity to areas with few or no other options, it is not foolproof. Connections can suffer slowdowns during peak hours when most users in a given cell are likely to be sharing bandwidth, or if the dish experiences interference from nearby household appliances, fluorescent lights or other Wi-Fi networks. And obstructions such as cloud cover, tree branches or thick walls can interrupt the connection.

As such, it’s important to develop apps that can withstand intermittent slowness and interruptions and remain fully available. To do so, you must maximize the efficient use of this precious shared network resource by moving the smallest amount of data possible, in its most compact form.

What Is Couchbase?

Couchbase is a NoSQL cloud database platform with in-memory speed, SQL familiarity and JSON flexibility. It natively supports the edge architecture by providing:

  • Couchbase Capella: A fully managed cloud database-as-a-service (DBaaS).
  • Capella App Services: Fully managed services for file storage, bidirectional sync, authentication and access control for mobile and edge apps.
  • Couchbase Lite: A lightweight embeddable version of the Couchbase database.

Capella App Services synchronizes data between the backend cloud database and edge databases as connectivity allows, while during network disruptions apps continue to operate thanks to local data processing.

With Couchbase, you can create multitier edge architectures to support any speed, availability or low bandwidth requirement.

Couchbase provides built-in sync to support complex multitier edge architectures (Source: Couchbase)

Testing the Stack

Couchbase Engineering wanted to determine a baseline for how the ASC stack works better together, with each technology working to enhance and augment the others functionality.

To do so, we set up the stack in a remote location in a classic edge architecture:

  • AWS Snowball Edge provides computing infrastructure at the edge.
  • Couchbase is deployed to the Snowball Edge device for local data storage and processing.
  • Couchbase Capella serves as the hosted backend DBaaS in the cloud.
  • Starlink provides the network from the Snowball Edge device to Couchbase Capella.
  • Couchbase Capella App Services provides secure synchronization between the edge database and the cloud database.

Couchbase running on AWS Snowball using Starlink to Sync to Capella (Source: Couchbase)

With this basic edge architecture in place, we set out to measure its effectiveness in reducing latency and bandwidth consumption as compared to a cloud architecture where the app reads and writes over Starlink via REST.

We ran four test scenarios:

  • Test 1 was to write 1,000 new docs to the Snowball Edge over a wired local area network (LAN) and measure the amount of data transferred and latency per operation.
  • Test 2 was to sync the 1,000 new docs from Snowball to Capella over Starlink and measure the amount of data transferred and complete time to transfer.
  • Test 3 was to write 1,000 new docs to Capella over Starlink and measure the amount of data transferred and latency per operation.
  • Test 4 was to sync the 1,000 new docs from Capella to Snowball over Starlink and measure the amount of data transferred and complete time to transfer.

The tests used a real-world product catalog dataset (1,000 products), at 650 bytes per record on average.

The Results

Latency Comparison

Apps accessing the Couchbase database running on the local Snowball device showed significantly reduced latency as compared to accessing the cloud database.

For reads and writes, results showed that the edge architecture reduced latency by 98% as compared to the cloud architecture:

Bandwidth Comparison

When comparing the edge architecture and the cloud architecture for bandwidth usage, results showed that the total data volume sent over Starlink decreased substantially on the edge architecture.

Because of efficiencies enabled by synchronization, like cross-record compression, delta compression, batching, filtering, restart ability and more, the edge architecture makes the best use of shared bandwidth — critical for peak periods, when connecting under heavy cloud cover, or in remote areas like forests or jungles where obstructions may impede speed and throughput.

Syncing updates over the edge architecture reduces the amount of data transferred over Starlink by 42% as opposed to using REST calls to a cloud architecture.

These test results establish a conservative baseline for latency and bandwidth improvements that can be seen when using the ASC stack for a basic edge architecture. In a large production environment, improvements are likely to be more substantial.

The Edge Is Closer Than You Think

Couchbase has a long history of helping customers meet critical requirements for real-time speed and 100% uptime for their applications. And with the ASC stack, Couchbase joins forces with AWS Snowball and SpaceX Starlink to help organizations adopt edge computing faster, easier and in more places than ever before.

And the best part is, the stack is so portable, you can literally take it with you wherever you go!

Actual AWS Snowball Edge device and Starlink dish used in Couchbase testing

See for yourself how easy it is to get started. Try Couchbase Capella today for free.

The post The Right Stuff for Really Remote Edge Computing appeared first on The New Stack.

]]>
The Next Wave of Network Orchestration: MDSO https://thenewstack.io/the-next-wave-of-network-orchestration-mdso/ Thu, 27 Oct 2022 17:00:39 +0000 https://thenewstack.io/?p=22690183

Demand for network automation and orchestration continues to rise as organizations reap the business and technical benefits it brings to

The post The Next Wave of Network Orchestration: MDSO appeared first on The New Stack.

]]>

Demand for network automation and orchestration continues to rise as organizations reap the business and technical benefits it brings to their operations, including significant improvements in productivity, cost reduction, and efficiency.

As a result, many organizations are now looking to the next wave of network orchestration: orchestration across technology domains, more commonly known as Multi-Domain Service Orchestration (MDSO).

Early adopters have learned that effectively leveraging automation and orchestration at the domain level doesn’t necessarily translate to the MDSO layer due to the different capabilities required to effectively coordinate and communicate across different technologies. While the potential benefits of MDSO are high, organizations must tackle unique challenges in multidomain deployments.

When orchestrating across domains versus within specific domains, the most obvious difference is the need to design around the direction your network data will travel.

Within a single domain, the activities are primarily focused north to south and vice versa. Instructions are sent to the domain controller, which executes the changes to the network functions. This makes single-domain orchestration relatively straightforward.

However, when you start orchestrating across domains, things get a little more complex. Now you need to account for both north/south activities and also for a large amount of east/west transactions, including interactions with inventory systems, service assurance systems, order systems, ticketing systems, etc. This places new demands on the orchestration environment for flexibility and the capability to interact with other systems and data formats.

A Change in Thinking

MDSO also requires a change in thinking about your availability models. Instead of focusing solely on server availability — which is the primary concern when you’re orchestrating within a domain — network teams also need to focus on service availability.

In other words, what system, or group of systems, within the multiple domains will ultimately affect service availability and the speed of orchestration? The team should perform a systems analysis and develop mitigation strategies during the planning stage, designing for the availability of the server but also being realistic about what the availability of the orchestration service will be.

When designing an orchestration solution across multiple domains, network teams also need to be aware that the more systems with which they integrate, the more dependencies they’ll need to consider, including dependencies that can be outside their or their vendors’ control — such as those managed by a third party. This will have a material effect on service levels and service availability and should be a factor to include in negotiations with those third parties.

The ability to perform MDSO is also dependent on good data and multiple sources of truth. A design built around distributed sources of truth is critical due to the existence of multiple controllers, legacy systems, cloud environments, etc.

It’s important to bear in mind: you don’t need to design for a perfect solution that provides 100 percent automation coverage from the start. Trying to do that will cost more in lost time than automation could ever save. Instead, focus on the end state and the overall goals for your orchestration solution. Start with what you know, put something into production, and then measure the results so you can learn and pivot as you go. You can always optimize later.

And think modular. A modular MDSO solution design provides greater flexibility and the opportunity to develop new kinds of combinatory services, delivering a variety of benefits and functionality. 

Also, remember, MDSO doesn’t equal manual orchestration converted into code. Systems operate in different ways than people do. Revisit your existing processes and reimagine them based on computer-centric thinking when planning your MDSO solution.

MDSO in Action

Working with the right partner can make the transition to MDSO much easier and more effective.

Itential’s approach to automation and orchestration is helping organizations successfully implement MDSO. Through the Itential Automation Platform (IAP), we’ve executed more than 5 billion automation and orchestration flows across a diverse set of networking domains for enterprises and service providers across multiple verticals, including telecommunications, financial services, managed services, the public sector, and more.

The IAP is a network automation and orchestration platform offered as a service or as an on-premise solution. Designed from inception to be vendor- and technology-agnostic, the IAP supports a wide range of network and service orchestration use cases across a wide range of technologies, including cloud/containerized network functions, physical network functions, network services, applications, IT systems and OSS.

When Lumen was looking for a way to provision services faster across an increasingly complex network, and to unify their internal processes, customer purchase workflows, and disparate networking products, they turned to Itential.

Using the IAP, Lumen is able to orchestrate functions across its multiple network domains, such as SD-WAN, optical, IP, data center, and edge. Lumen’s MDSO solution integrates with the control systems for each of its separate networks and with more than 50 unique systems and network technologies, providing end-to-end visibility and management, and streamlining and accelerating how they deliver services to their customers. 

As a result, the company was able to automate more than 200 million processes, reduce customer activation times by 70%, reduce the time to route an order from 40 days to one, and reduce the time from ordering to provisioning a VPN service from more than two weeks to two days. Itential’s technology has made it possible for Lumen’s customers to leverage their application and data delivery platform in a way that best suits their individual needs.

Taking the Next Step

When addressing large-scale MDSO challenges, network teams should keep a few things in mind:

  • Think end state while building incrementally
  • Remember that MDSO doesn’t equal manual orchestration converted into code
  • Build a modular solution around multiple sources of truth
  • Don’t try to build everything upfront or have 100 percent automation coverage all at once
  • Focus on what brings the most business value to your organization and learn from the implementation process

Orchestrating across multiple domains can be challenging, but the benefits to an organization’s operations and business results are worth the effort.

The post The Next Wave of Network Orchestration: MDSO appeared first on The New Stack.

]]>
Sidecars are Changing the Kubernetes Load-Testing Landscape https://thenewstack.io/sidecars-are-changing-the-kubernetes-load-testing-landscape/ Mon, 24 Oct 2022 17:00:06 +0000 https://thenewstack.io/?p=22689822

As your infrastructure is scaling and you start to get more traffic, it’s important to make sure everything works as

The post Sidecars are Changing the Kubernetes Load-Testing Landscape appeared first on The New Stack.

]]>

As your infrastructure is scaling and you start to get more traffic, it’s important to make sure everything works as expected. This is most commonly done through testing, with load testing being the optimal way of verifying the resilience of your services.

Traditionally, load testing has been accomplished via standalone clients, like GoReplay and JMeter. However, as the world of infrastructure has gotten more modern, and organizations are using tools like Kubernetes, it’s important to have a modern toolset as well.

With traditional load testing, you’ll commonly run into one of three major issues:

  • Scripting load tests takes a lot of time
  • Load tests typically run in large, complex, end-to-end environments, that are difficult to provision, as well as being expensive for production-scale infrastructure
  • Data and realistic use cases are impossible to mirror one-to-one, unless you have production data

A more modern approach is to integrate your load-testing tools directly into your infrastructure. If you’re using Kubernetes, that can be accomplished via something like an Operator. This will give you the benefit of having less infrastructure to manage, and being able to use infrastructure-specific features like autoscaling Pods.

However, the biggest advantage you get from integrating your tool into Kubernetes directly is that you can use sidecars.

You Need to Focus on Traffic Capture

There are two main components of any proper load-testing setup: traffic capture and traffic replay. This is how most load-testing tools have worked for decades, and still holds true. However, there are many ways to capture traffic.

The next section will dive deeper into how sidecars specifically help with this. But first, it’s important to understand why there’s even an issue with traditional load testing.

First, how do you record the traffic? The most common approach is to record your own traffic. You open up a browser that’s configured to capture traffic, and you start emulating how your users would be using your service.

While this can certainly be helpful in many areas, it does mean you are relying on your own perception of how your application is being used. For a simple application, this may not be an issue but if your product is a Saas platform, there are likely many ways your application can be used.

Second, you’re missing out on important metadata. Again, perhaps not an issue for a simple application, but something that gets increasingly important as your application scales.

If you’re the one creating the traffic, you will only be getting the headers that your specific environment produces, meaning your load tests will only be testing one specific scenario. A better approach is to capture production traffic and use that in your replays, as you will be getting a range of different requests.

Third, you need to think about your dependencies. If you are load testing your entire infrastructure it likely won’t be an issue. But, if you want to do a load test on a single service, i.e. when testing is a part of your CI/CD pipeline, you want to make sure that any outgoing requests go to a mock service.

With traditional load-testing tools, you’re not going to get this. Either you’ll have to set up a new service to act as a mock, or you’ll have to change the code of your application.

Fourth, you are adding more infrastructure to manage. In a modern DevOps world, you don’t want load tests to be run from a developer’s laptop. It should be executed from a service running inside your infrastructure. So, with most tools you will have to add additional services that you need to manage.

Using Sidecars to Capture Traffic

Now that you know why traditional load testing doesn’t fit into a modern infrastructure, let’s touch on the previous points, exploring how sidecars are going to solve the problems.

In Kubernetes, it’s very easy to add a sidecar and make it act as a proxy. Already you’ve solved the issue of how to capture traffic. This sidecar will be capturing all inbound requests, and sending it to another system that can store the recorded requests.

With this approach it’s easy to implement shadowing, which is the practice of capturing production traffic and replaying it in your development environment.

Because you’re recording production data directly, you will also know for sure that you’ll be getting all the metadata from every request. You may be seeing issues with this already though. What about authentication, tokens, or sensitive information?

Well, sidecars won’t just be capturing traffic. It will take care of replaying it as well. And, because the sidecar is an application in itself, it’s easy to transform any metadata, like timestamps, before it sends it to your application.

As for sensitive data, the sidecar can easily filter that out from the metadata before it gets sent to storage.

At this point, you’ve already solved the issue of managing extra infrastructure, as sidecars can be added via a few simple lines to a manifest file. You’ve also solved the issue of missing metadata, as the proxy will capture everything, unless specifically configured to do otherwise.

Then there’s still the issue of mocking dependencies. The sidecar is still acting as a proxy, both for ingoing and outgoing requests. Because of this, you can tell the sidecar that you’re executing a load test, and it will recognize the outgoing requests. Instead of forwarding the request, it will simply respond with the data it got when the request was initially recorded.

If you want to see a specific example of how sidecars can be used to perform load tests, take a look at this example from Speedscale.

Sidecars Aren’t the Only Way to Capture Traffic

At this point, it should be clear how sidecars can be used to change the way you approach load testing. However, it’s also important to mention that sidecars are just one of many ways to improve traffic capture.

With how the industry is changing, it’s very likely that traffic capture will become commoditized, and the best tool will be the one that does the best job of replaying traffic.

Postman is used by many developers to test their APIs, and have created extensive collections with various different request definitions. While this does go somewhat against the points of not creating traffic manually, there is still value to be gained from being able to test specific use cases.

Service meshes are a popular way to manage communication between services, especially microservices. Most service meshes will enable you to have some kind of observability. Through this, you’ll gain a collection of production traffic, that you can then import into the load-testing tool of your choice.

Given the topic of this post, it’s important to note that some companies are working on removing sidecars from the implementation of service meshes. However, right now it’s just an option, and it’ll be exciting to see whether the future of services meshes are sidecar-less or not.

eBPF is becoming more and more used in modern infrastructure, like Kubernetes. Essentially, eBPF lets you execute applications at the kernel level, and, as such, it’s a very effective way of capturing traffic.

High-fidelity logs from a monitoring system is also a possible way to capture traffic. However, it’s important to note here that it has to be high-fidelity. To save costs and storage space, it’s very common to strip request logs of some headers that won’t be helpful in troubleshooting. If you are intending to use logs for traffic replay and reap the most benefits, you need to keep all the information.

API Gateways/Ingress controllers can possibly be a good way of capturing traffic, as it’s very common for gateways and controllers to log the traffic going through. If traffic capture truly becomes commoditized, it’s then simply a matter of importing the traffic into your load-testing tool.

Traffic Replay Is a Powerful Tool in Load Testing

Sidecars are going to change the way that load testing is done, and some companies are already embracing it. Whether it’s because you’ll be using them for capturing traffic, replaying it, or mocking the outgoing requests, there’s no doubt that you’ll be seeing an increase in the use of sidecars.

If you’re interested in testing what it’s like to load test Kubernetes using sidecars, go ahead and sign up for a free trial for Speedscale. Speedscale is a new pioneering product in Kubernetes load testing, focusing heavily on features like traffic replay, shadowing, API mocks, etc.

The post Sidecars are Changing the Kubernetes Load-Testing Landscape appeared first on The New Stack.

]]>
Confluent: Have We Entered the Age of Streaming? https://thenewstack.io/confluent-have-we-entered-the-age-of-streaming/ Wed, 12 Oct 2022 14:56:19 +0000 https://thenewstack.io/?p=22689396

Three years ago, when we posed the question, “Where is Confluent going?”, the company was still at the unicorn stage.

The post Confluent: Have We Entered the Age of Streaming? appeared first on The New Stack.

]]>

Three years ago, when we posed the question, “Where is Confluent going?”, the company was still at the unicorn stage. But it was at a time when Apache Kafka was emerging as the default go-to-publish/subscribe messaging engine for the cloud era. At the time, we drew comparisons with Spark, which had a lot more visibility. Since that time, Confluent has IPO’ed while Databricks continues to sit on a huge cash trove as it positions itself as the de facto rival to Snowflake for the Data Lakehouse. Pulsar recently emerged as a competing project, but is it game over? Hyperscalers are offering alternatives like Amazon Kinesis, Azure Event Hub, and Google Cloud Dataflow but guess what? AWS co-markets Confluent Cloud, with a similar arrangement with Microsoft Azure on the way.

More to the point, today, the question is not so much about Confluent or Kafka, but instead, is streaming about to become the norm?

Chicken and Egg

At its first post-pandemic conference event last week, CEO Jay Kreps evangelized streaming using electricity as the metaphor. Kreps positioned streaming as pivotal to the next wave of apps in chicken and egg terms. That is, when electricity was invented, nobody thought of use cases aside from supplanting water or steam power, but then the light bulb came on, literally. And so, aside from generalizations about the importance of real-time information, Kreps asserted that the use cases for streaming will soon become obvious because the world operates in real time, and having immediate visibility is becoming table stakes.

Put another way, in retrospect, conventional wisdom will ask why we didn’t think of real-time to start off with.

There’s a certain irony here: streaming is just one piece of Kafka, as the crux of the system is really about PubSub. And the way Apache Kafka is architected, you can plug and play almost any streaming service; you’re not limited to Kafka Streaming. And as to PubSub, what’s really new here? The first PubSub implementation predated the dot com era.

Back in the 1990s, Tibco pioneered a commercial market for technology that capital market firms previously were inventing on their own. Vivek Ranadive, CEO and founder of Tibco, wrote a book about it titled The Power of Now back in the dot com era. Tibco later refined the message to “The 2-Second Advantage.” The premise was that, even if you didn’t have all the information, if you have just enough of it two seconds ahead of everybody else, that should provide competitive advantage if you’re operating with real-time use cases like capital markets tickers; any form of network management (telcos and supply chain come to mind); and — this being the 90s — e-commerce.

What Changed?

Kafka reinvented PubSub messaging for massively distributed scale on commodity hardware, and unlike the Tibco and IBM MQ era, was open source. OK, Kafka wasn’t the first open source messaging system. But its massive parallelism left Rabbit MQ and Java Message Service in the dust. Although not initially conceived for the cloud, Kafka’s scale-out, distributed architecture anticipated it. Apache Pulsar notwithstanding, a critical mass ecosystem of commercial support and related open source projects made Kafka the de facto for PubSub technology for cloud native environments. As to the old idea of the 2-Second Advantage for making decisions before you have all the information, today’s fat pipes and cloud scale make that question academic.

Bottom line? Today, roughly 80% of Fortune 100 companies use Kafka, according to the Apache Software Foundation.

But We’re Talking about Streaming

While PubSub has been the linchpin for Kafka’s success, the overriding theme for the Confluent Current conference last week was about heralding the age of streaming. Confluent’s view is that streaming will become the new norm for analytics because the world operates in real time, and so should analytics. It brought out use cases from reference customers like Expedia, for travel pricing; the US Postal Service, for handling requests for free COVID tests from its website; and Pinterest, for real-time content recommendations.

But is every company like born-online outfits like Pinterest or national utilities like the Postal Service? That’s redolent of the question, is every company a dot com, or whether every company needs what we used to term “big data.” Those are metaphors that Confluent would likely not object to, as today, it’s unthinkable for any B2C or B2B company to not have an online presence, not to mention that handling many of the “Vs” of “big data” has now become pretty routine with cloud analytics services like Redshift or Snowflake.

Nonetheless, the issue of whether streaming has become the new norm was very much up for discussion. We give a lot of credit to Confluent for making room for diverse viewpoints at its event. One hurdle is a matter of perception. In his session, Streaming is Still Not the Default. Decodable CEO Eric Sammer posed the question as to why most analytics systems are still running in batch. He rifled through rationalizations such as that specific use cases don’t require streaming. His comments were reinforced by Snowflake engineers Dan Sotolongo and Tyler Akidau in their session on Streaming 101.

But dig deeper and there is the fear of the unknown. According to Sammer, streaming technology is still inherently more complex than batch. For now, you still have to orchestrate multiple components like message brokers, stream processing formats, and connectors; it’s a problem redolent of the managing of zoo animals that limited Hadoop adoption. The culprit is not the core Kafka technology but the fact that tooling is disaggregated and too low-level. Architects and developers must juggle low-level parameters such as buffer sizing (which impacts anomaly detection), distinguishing between event vs. processing time (to ensure feeds are properly sequenced); and managing transactional consistency with external consistency.

Will Kafka Become Less Kafkaesque?

There’s good news on the horizon. The Apache project is replacing Zookeeper, the same tool for managing distributed configurations from the Hadoop days, with an internal capability called KRaft (based on the Raft consensus protocol) that is now in early access. Sammer said that he would like the Apache project to go further by building its own schema registry. We won’t argue. Just as databases hide a lot of the primitives under the hood, we’d like to see more of that with Kafka. The burden is both on the open source project and the tooling ecosystem. And for Kafka SaaS providers, we’d like them to offer serverless.

Confluent’s growth has not simply rested on Kafka — if so, its cloud service would likely not be more than doubling year over year as each hyperscaler offers their own managed Kafka services. Like Databricks, it has focused on building an end-to-end platform that unifies what would otherwise be a complex toolchain. They have built their own unique implementation of Apache Kafka that abstracts and tiers storage, making the service elastic and more economical than the attached block storage that is otherwise the norm. They have focused on simplifying the user experience for a technology that got its name because the problem was Kafkaesque.

Confluent has made Apache Kafka (which was written for Java) accessible to developers of Python and other languages; offering over a hundred out-of-the-box connectors and offer their own streaming SQL engine. Announcements made at the conference include a new Stream Designer targeted at the no code/low code crowd and a new enhanced edition of Stream Governance. They encompass for stream governance with a globally available schema registry; a catalog that now includes business metadata; and adding point-in-time playback capability to data lineage. In the theme of operational simplification, we would like to see Confluent simplify the setup for observability; for instance, customers should not have to specially set up their own time series database clusters for tracking performance; Confluent states that this capability for most customers right now would be overkill. We say this out of tough love; Confluent has made important strides towards making Kafka enterprise-ready.

Cut to the Chase: Is Streaming Ready for Prime Time?

Confluent, Snowflake, and the hyperscalers have helped demystify and lower the barriers to what we used to term “big data.” Heck, even Cloudera has helped us forget about the Zoo Animals. Yes, there are still operational complexities that are unique to streaming, but then again, we said the same thing about analyzing large volumes and varieties of data. It’s a learning process.

The larger question is about whether streaming will cross the chasm or jump the shark to the enterprise. We view analytics as a spectrum or requirements depending on the use case, and the answers aren’t necessarily cut and dried. For instance, if much of your business is online, that doesn’t necessarily dictate that it must be driven in real time. Probably the biggest example is large, high-value capital goods, where orders are not likely to spike in the same way that a flashy new, long-awaited must-have mobile device might. On the other hand, even if the product or service that your organization delivers might not be volatile to minute-by-minute swings, chances are some aspects of your business might be. For instance, that large capital goods manufacturer is likely to have rapid, transient outliers in its supply chain that could impact long-term planning.

Our answer to Confluent’s call to action? Streaming is not the shiny new thing, but it is one of the new pieces of the puzzle that most organizations will need to run their business.

The post Confluent: Have We Entered the Age of Streaming? appeared first on The New Stack.

]]>
How Idit Levine’s Athletic Past Fueled Solo.io‘s Startup https://thenewstack.io/how-idit-levines-athletic-past-fueled-solo-ios-startup/ Fri, 16 Sep 2022 18:41:11 +0000 https://thenewstack.io/?p=22683723

Idit Levine’s tech journey originated in an unexpected place: a basketball court. As a seventh grader in Israel, playing in

The post How Idit Levine’s Athletic Past Fueled Solo.io‘s Startup appeared first on The New Stack.

]]>

Idit Levine’s tech journey originated in an unexpected place: a basketball court. As a seventh grader in Israel, playing in hoops tournaments definitely sparked her competitive side.

How Idit Levine’s Athletic Past Fueled Solo.io‘s Startup

“I was basically going to compete with all my international friends for two minutes without parents, without anything,” Levine said. “I think it made me who I am today. It’s really giving you a lot of confidence to teach you how to handle situations … stay calm and still focus.”

Developing that calm and focus proved an asset during Levine’s subsequent career in professional basketball in Israel, and when she later started her own company. In this episode of The Tech Founder Odyssey podcast series, Levine, founder and CEO of Solo.io, an application networking company with a $1 billion valuation, shared her startup story.

The conversation was co-hosted by Colleen Coll and Heather Joslyn of The New Stack

After finishing school and service in the Israeli Army, Levine was still unsure of what she wanted to do. She noticed her brother and sister’s fascination with computers. Soon enough, she recalled,  “I picked up a book to teach myself how to program.”

It was only a matter of time before she found her true love: the cloud native ecosystem. “It’s so dynamic, there’s always something new coming. So it’s not boring, right? You can assess it, and it’s very innovative.”

Moving from one startup company to the next, then on to bigger companies including Dell EMC where she was chief technology officer of the cloud management division, Levine was happy seeking experiences that challenged her technically. “And at one point, I said to myself, maybe I should stop looking and create one.”

Learning How to Pitch

Winning support for Solo.io demanded that the former hoops player acquire an unfamiliar skill: how to pitch. Levine’s company started in her current home of Boston, and she found raising money in that environment more of a challenge than it would be in, say, Silicon Valley.

It was difficult to get an introduction without a connection, she said:  “I didn’t understand what pitches even were but I learned how … to tell the story. That helped out a lot.”

Founding Solo.io was not about coming up with an idea to solve a problem at first. “The main thing at Solo.io, and I think this is the biggest point, is that it’s a place for amazing technologists, to deal with technology, and, beyond the top of innovation, figure out how to change the world, honestly,” said Levine.

Even when the focus is software, she believes it’s eventually always about people. “You need to understand what’s driving them and make sure that they’re there, they are happy. And this is true in your own company. But this is also [true] in the ecosystem in general.”

Levine credits the company’s success with its ability to establish amazing relationships with customers – Solo.io has a renewal rate of 98.9% – using a very different customer engagement model that is similar to users in the open source community. “We’re working together to build the product.”

Throughout her journey, she has carried the idea of a team: in her early beginnings in basketball, in how she established a “no politics” office culture, and even in the way she involves her family with Solo.io.

As for the ever-elusive work/life balance, Levine called herself a workaholic, but suggested that her journey has prepared her for it:  “I trained really well. Chaos is a part of my personal life.”

She elaborated, “I think that one way to do this is to basically bring the company to [my] personal life. My family was really involved from the beginning and my daughter chose the logos. They’re all very knowledgeable and part of it.”

Like this episode? Here are more from The Tech Founder Odyssey series:

From DB2 to Real-Time with Aerospike Founder Srini Srinivasan

The Stone Ages of Open Source Security

Tina Huang: Curating for SRE Through Lessons Learned at Google News

The post How Idit Levine’s Athletic Past Fueled Solo.io‘s Startup appeared first on The New Stack.

]]>