Data Storage Systems Overview | The New Stack

The Architect’s Guide to Storage for AI

Keith Pijanowski — Thu, 01 Jun 2023 14:44:06 +0000

Choosing the best storage for all phases of a machine learning (ML) project is critical. Research engineers need to create multiple versions of datasets and experiment with different model architectures. When a model is promoted to production, it must operate efficiently when making predictions on new data. A well-trained model running in production is what adds AI to an application, so this is the ultimate goal.

As an AI/ML architect, this has been my life for the past few years. I wanted to share what I’ve learned about the project requirements for training and serving models and the available options. Consider this a survey. I will cover everything from traditional file systems variants to modern cloud native object stores that are designed for the performance at scale requirements associated with large language models (LLMs) and other generative AI systems.

Once we understand requirements and storage options, I will review each option to see how it stacks up against our requirements.

Before we dive into requirements, reviewing what is happening in the software industry today will be beneficial. As you will see, the evolution of machine learning and artificial intelligence are driving requirements.

The Current State of ML and AI

Large language models (LLMs) that can chat with near-human precision are dominating the media. These models require massive amounts of data to train. There are many other exciting advances concerning generative AI, for example, text-to-image and sound generation. These also require a lot of data.

It is not just about LLMs. Other model types exist that solve basic lines of business problems. Regression, classification and multilabel are model types that are nongenerative but add real value to an enterprise. More and more organizations are looking to these types of models to solve a variety of problems.

Another phenomenon is that an increasing number of enterprises are becoming SaaS vendors that offer model-training services using a customer’s private data. Consider an LLM that an engineer trained on data from the internet and several thousand books to answer questions, much like ChatGPT. This LLM would be a generalist capable of answering basic questions on various topics.

However, it might not provide a helpful, detailed and complete answer if a user asks a question requiring advanced knowledge of a specific industry like health care, financial services or professional services. It is possible to fine-tune a trained model with additional data.

So an LLM trained as a generalist can be further trained with industry-specific data. The model would then provide better answers to questions about the specified industry. Fine-tuning is especially beneficial when done with LLMs, as their initial training can cost millions, and the fine-tuning cost is much cheaper.

Regardless of the model you are building, once it is in production, it has to be available, scalable and resilient, just like any other service you deploy as a part of your application. If you are a business offering a SaaS solution, you will have additional security requirements. You will need to prevent direct access to any models that represent your competitive advantage, and you will need to secure your customers’ data.

Let’s look at these requirements in more detail.

Machine Learning Storage Requirements

The storage requirements listed below are from the lens of a technical decision-maker assessing the viability of any piece of technology. Specifically, technologies used in a software solution must be scalable, available, secure, performant, resilient and simple. Let’s see what each requirement means to a machine learning project.

Scalable: Scalability in a storage solution refers to its ability to handle an increasing amount of storage without requiring significant changes. In other words, scalable storage can continue to function optimally as capacity and throughput requirements increase. Consider an organization starting its ML/AI journey with a single project. This project by itself may not have large storage requirements. However, soon other teams will create their initiatives. These new teams may have small storage requirements. However, collectively, these teams may put a considerable storage requirement on a central storage solution. A scalable storage solution should scale its resources (either out or up) to handle the additional capacity and throughput needed as new teams onboard their data.

Available: Availability is a property that refers to an operational system’s ability to carry out a task. Operations personnel often measure availability for an entire system over time. For example, the system was available for 99.999% of the month. Availability can also refer to the wait time an individual request experiences before a resource can start processing it. Excessive wait times render a system unavailable.

Regardless of the definition, availability is essential for model training and storage. Model training should not experience delays due to lack of availability in a storage solution. A model in production should be available for 99.999% of the month. Requests for data or the model itself, which may be large, should experience low wait times.

Secure: Before all read or write operations, a storage system should know who you are and what you can do. In other words, storage access needs to be authenticated and authorized. Data should also be secure at rest and provide options for encryption. The hypothetical SaaS vendor mentioned in the previous section must pay close attention to security as they provide multitenancy to their customers. The ability to lock data, version data and specify retention policy are also considerations that are part of the security requirement.

Performant: A performant storage solution is optimized for high throughput and low latency. Performance is crucial during model training because higher performance means that experiments are completed faster. The number of experiments an ML engineer can perform is directly proportional to the accuracy of the final model. If a neural network is used, it will take many experiments to determine the optimal architecture. Additionally, hyperparameter tuning requires even further experimentation. Organizations using GPUs must take care to prevent storage from becoming the bottleneck. If a storage solution cannot deliver data at a rate equal to or greater than a GPU’s processing rate, the system will waste precious GPU cycles.

Resilient: A resilient storage solution should not have a single point of failure. A resilient system tries to prevent failure, but when failures occur, it can gracefully recover. Such a solution should be able to participate in failover and stay exercises where the loss of an entire data center is emulated to test the resiliency of a whole application.

Models running in a production environment require resiliency. However, resiliency can also add value to model training. Suppose an ML team uses distributed training techniques that use a cluster. In that case, the storage that serves this cluster, as well as the cluster itself, should be fault tolerant, preventing the team from losing hours or days due to failures.

Simple: Engineers use the words “simple” and “beauty” synonymously. There is a reason for this. When a software design is simple, it is well thought out. Simple designs fit into many different scenarios and solve a lot of problems. A storage system for ML should be simple, especially in the proof of concept (PoC) phase of a new ML project when researchers need to focus on feature engineering, model architectures and hyperparameter tuning while trying to improve the performance of a model so it is accurate enough to add value to the business.

The Storage Landscape

There are several storage options for machine learning and serving. Today, these options fall into the following categories: local file storage, network-attached storage (NAS), storage-area networks (SAN), distributed file systems (DFS) and object storage. In this section, I’ll discuss each and compare them to our requirements. The goal is to find an option that measures up the best across all requirements.

Local file storage: The file system on a researcher’s workstation and the file system on a server dedicated to model serving are examples of local file systems used for ML storage. The underlying device for local storage is typically a solid-state drive (SSD), but it could also be a more advanced nonvolatile memory express drive (NVMe). In both scenarios, compute and storage are on the same system.

This is the simplest option. It is also a common choice during the PoC phase, where a small R&D team attempts to get enough performance out of a model to justify further expenses. While common, there are drawbacks to this approach.

Local file systems have limited storage capacity and are unsuitable for larger datasets. Since there is no replication or autoscaling, a local file system cannot operate in an available, reliable and scalable fashion. They are as secure as the system they are on. Once a model is in production, there are better options than a local file system for model serving.

Network-attached storage (NAS): NAS is a TCP/IP device connected to a network that has an IP address, much like a computer. The underlying technology for storage is a RAID array of drives, and files are delivered to clients via TCP. These devices are often delivered as an appliance. The compute needed to manage the data and the RAID array are packaged into a single device.

NAS devices can be secured, and the RAID configuration of the underlying storage provides some availability and reliability. NAS uses data transfer protocols like Server Message Block (SMB) and Network File System (NFS) to encapsulate TCP for data transfer.

NAS devices run into scaling problems when there are a large number of files. This is due to the hierarchy and pathing of their underlying storage structure, which maxes out at millions of files. This is a problem with all file-based solutions. Maximum storage for a NAS is on the order of tens of terabytes.

Storage-area network (SAN): A SAN combines servers and RAID storage on a high-speed interconnect. With a SAN, you can put storage traffic on a dedicated fiber channel using the Fiber Channel Protocol (FCP). A request for a file operation may arrive at a SAN via TCP, but all data transfer occurs via a network dedicated to delivering data efficiently. If a dedicated fiber network is unavailable, a SAN can use Internet Small Computer System Interface (iSCSI), which uses TCP for storage traffic.

A SAN is more complicated to set up than a NAS device since it is a network and not a device. You need a separate dedicated network to get the best performance out of a SAN. Consequently, a SAN is costly and requires considerable effort to administer.

While a SAN may look compelling when compared to a NAS (improved performance and similar levels of security, availability and reliability), it is still a file-based approach with all the problems previously described. The improved performance does not make up for the extra complexity and cost. Total storage maxes out around hundreds of petabytes.

Distributed file system: A distributed file system (DFS) is a file system that spans multiple computers or servers and enables data to be stored and accessed in a distributed manner. Instead of a single centralized system, a distributed file system distributes data across multiple servers or containers, allowing users to access and modify files as if they were on a single, centralized file system.

Some popular examples of distributed file systems include Hadoop Distributed File System (HDFS), Google File System (GFS), Amazon Elastic File System (EFS) and Azure Files.

Files can be secured, like the file-based solutions above, since the operating system is presented with an interface that looks like a traditional file system. Distributed file systems run in a cluster that provides reliability. Running in a cluster may result in better throughput when compared to a SAN; however, they still run into scaling problems when there are a large number of files (like all file-based solutions).

Object storage: Object storage has been around for quite some time but was revolutionized when Amazon made it the first AWS service in 2006 with Simple Storage Service (S3). Modern object storage was native to the cloud, and other clouds soon brought their offerings to market. Microsoft offers Azure Blob Storage, and Google has its Google Cloud Storage service. The S3 API is the de facto standard for developers to interact with storage and the cloud, and there are multiple companies that offer S3-compatible storage for the public cloud, private cloud, edge and co-located environments. Regardless of where an object store is located, it is accessed via a RESTful interface.

The most significant difference with object storage compared to the other storage options is that data is stored in a flat structure. Buckets are used to create logical groupings of objects. Using S3 as an example, a user would first create one or more buckets and then place their objects (files) in one of these buckets. A bucket cannot contain other buckets, and a file must exist in only one bucket. This may seem limiting, but objects have metadata, and using metadata, you can emulate the same level of organization that directories and subdirectories provide within a file system.

Object storage solutions also perform best when running as a distributed cluster. This provides them with reliability and availability.

Object stores differentiate themselves when it comes to scale. Due to the flat address space of the underlying storage (every object in only one bucket and no buckets within buckets), object stores can find an object among potentially billions of objects quickly. Additionally, object stores offer near-infinite scale to petabytes and beyond. This makes them perfect for storing datasets and managing large models.

Below is a storage scorecard showing solutions against requirements.

The Best Storage Option for AI

Ultimately, the choice of storage options will be informed by a mix of requirements, reality and necessity; however, for production environments, there is a strong case to be made for object storage.

The reasons are as follows:

Performance at scale: Modern object stores are fast and remain fast even in the face of hundreds of petabytes and concurrent requests. You cannot achieve that with other options.
Unstructured data: Many machine learning datasets are unstructured — audio, video and images. Even tabular ML datasets that could be stored in a database are more easily managed in an object store. For example, it is common for an engineer to treat the thousands or millions of rows that make up a training set as a single entity that can be stored and retrieved via a single simple request. The same is true for validation sets and test sets.
RESTful APIs: RESTful APIs have become the de facto standard for communication between services. Consequently, proven messaging patterns exist for authentication, authorization, security in motion and notifications.
Encryption: If your datasets contain personally identifiable information, your data must be encrypted while at rest.
Cloud native (Kubernetes and containers): A solution that can run its services in containers that are managed by Kubernetes is portable across all the major public clouds. Many enterprises have internal Kubernetes clusters that could run a Kubernetes-native object storage deployment.
Immutable: It’s important for experiments to be repeatable, and they’re not repeatable if the underlying data moves or is overwritten. In addition, protecting training sets and models from deletion, accidental or intentional, will be a core capability of an AI storage system when governments around the world start regulating AI.
Erasure coding vs. RAID for data resiliency and availability: Erasure coding uses simple drives to provide the redundancy required for resilient storage. A RAID array (made up of a controller and multiple drives), on the other hand, is another device type that has to be deployed and managed. Erasure coding works on an object level, while RAID works on a block level. If a single object is corrupted, erasure coding can repair that object and return the system to a fully operational state quickly (as in minutes). RAID would need to rebuild the entire volume before any data can be read or written, and rebuilding can take hours or days, depending on the size of the drive.
As many files as needed: Many large datasets used to train models are created from millions of small files. Imagine an organization with thousands of IoT devices, each taking a measurement every second. If each measurement is a file, then over time, the total number of files will be more than a file system can handle.
Portable across environments: A software-defined object store can use a local file, NAS, SAN and containers running with NVMe drives in a Kubernetes cluster as its underlying storage. Consequently, it is portable across different environments and provides access to underlying storage via the S3 API everywhere.

MinIO for ML Training and Inference

MinIO has become a foundational component in the AI/ML stack for its performance, scale, performance at scale and simplicity. MinIO is ideally configured in a cluster of containers that use NVMe drives; however, you have options to use almost any storage configuration as requirements demand.

An advantage of implementing a software-defined, cloud native approach is that the code becomes portable. Training code and model-serving code do not need to change as an ML project matures from a proof of concept to a funded project, and finally to a model in production serving predictions in a cluster.

While portability and flexibility are important, they are meaningless if they are at the expense of performance. MinIO’s performance characteristics are well known, and all are published as benchmarks.

Next Steps

Researchers starting a new project should install MinIO on their workstations. The following guides will get you started.

If you are responsible for the clusters that make up your development, testing and production environments, then consider adding MinIO.

MinIO Object Storage for Kubernetes

Learn by Doing

Developers and data scientists increasingly control their own storage environments. The days of IT closely guarding storage access are gone. Developers naturally gravitate toward technologies that are software defined, open source, cloud native and simple. That essentially defines object storage as the solution.

If you are setting up a new workstation with your favorite ML tools, consider adding an object store to your toolkit by installing a local solution. Additionally, if your organization has formal environments for experimentation, development, testing and production, then add an object store to your experimental environment. This is a great way to introduce new technology to all developers in your organization. You can also run experiments using a real application running in this environment. If your experiments are successful, promote your application to dev, test and prod.

The post The Architect’s Guide to Storage for AI appeared first on The New Stack.

8 Real-Time Data Best Practices

Jennifer Riggins — Thu, 01 Jun 2023 12:00:14 +0000

More than 300 million terabytes of data are created every day. The next step in unlocking the value from all that data we’re storing is being able to act on it almost instantaneously.

Real-time data analytics is the pathway to the speed and agility that you need to respond to change. Real-time data also amplifies the challenges of batched data, just continuously, at the terabyte scale.

Then, when you’re making changes or upgrades to real-time environments, “you’re changing the tires on the car, while you’re going down the road,” Ramos Mays, CTO of Semantic-AI, told The New Stack.

How do you know if your organization is ready for that deluge of data and insights? Do you have the power, infrastructure, scalability and standardization to make it happen? Are all of the stakeholders at the planning board? Do you even need real time for all use cases and all datasets?

Before you go all-in on real time, there are a lot of data best practices to evaluate and put in place before committing to that significant cost.

1. Know When to Use Real-Time Data

Just because you can collect real-time data, doesn’t mean you always need it. Your first step should be thinking about your specific needs and what sort of data you’ll require to monitor your business activity and make decisions.

Some use cases, like supply chain logistics, rely on real-time data for real-time reactions, while others simply demand a much slower information velocity and only need analysis on historical data.

Most real-time data best practices come down to understanding your use cases up front because, Mays said, “maintaining a real-time infrastructure and organization has costs that come alongside it. You only need it if you have to react in real time.

“I can have a real-time ingestion of traffic patterns every 15 seconds, but, if the system that’s reading those traffic patterns for me only reads it once a day, as a snapshot of only the latest value, then I don’t need real-time 15-second polling.”

Nor, he added, should he need to support the infrastructure to maintain it.

Most companies, like users of Semantic-AI, an enterprise intelligence platform, need a mix of historical and real-time data; Mays’ company, for instance, is selective about when it does and doesn’t opt for using information collected in real time.

He advises bringing together your stakeholders at the start of your machine learning journey and ask: Do we actually need real-time data or is near-real-time streaming enough? What’s our plan to react to that data?

Often, you just need to react if there’s a change, so you would batch most of your data, and then go for real time only for critical changes.

“With supply chain, you only need real-time time if you have to respond in real time,” Mays said. “I don’t need real-time weather if I’m just going to do a historic risk score, [but] if I am going to alert there’s a hurricane through the flight path of your next shipment [and] it’s going to be delayed for 48 hours, you’re reacting in real time.”

2. Keep Data as Lightweight as Possible

Next you need to determine which categories of data actually add value by being in real time in order to keep your components lightweight.

“If I’m tracking planes, I don’t want my live data tracking system to have the flight history and when the tires were last changed,” Mays said. “I want as few bits of information as possible in real time. And then I get the rest of the embellishing information by other calls into the system.”

Real-time data must be designed differently than batch, he said. Start thinking about where it will be presented in the end, and then, he recommended, tailor your data and streams to be as close to your display format as possible. This helps determine how the team will respond to changes.

For example, if you’ve got customers and orders, one customer can have multiple orders. “I want to carry just the amount of information in my real-time stream that I need to display to the users,” he said, such as the customer I.D. and order details. Even then, you will likely only show the last few orders in live storage, and then allow customers to search and pull from the archives.

For risk scoring, a ground transportation algorithm needs real-time earthquake information, while the aviation algorithm needs real-time wind speed — it’s rare that they would both need both.

Whenever possible, Mays added, only record deltas — changes in your real-time data. If your algorithm is training on stock prices, but those only change every 18 seconds, you don’t need it set for every quarter second. There’s no need to store those 72 data points across networks when you could only send one message when the value changes. This in turn reduces your organizational resource requirements and focuses again on the actionable.

3. Unclog Your Pipes

Your data can be stored in the RAM of your computer, on disk or in the network pipe. Reading and writing everything to the disk is the slowest. So, Mays recommended, if you’re dealing in real-time systems, stay in memory as much as you can.

“You should design your systems, if at all possible, to only need the amount of data to do its thing so that it can fit in memory,” he said, so your real-time memory isn’t held up in writing and reading to the disk.

“Computer information systems are like plumbing,” Mays said. “Still very mechanical.”

Think of the amount of data as water. The size of pipes determines how much water you can send through. One stream of water may need to split into five places. Your pipes are your network cables or, when inside the machine, the I/O bus that moves the data from RAM memory to the hard disk. The networks are the water company mainlines, while the bus inside acts like the connection between the mainlines and the different rooms.

Most of the time, this plumbing just sits there, waiting to be used. You don’t really think about it until you are filling up your bathtub (RAM.) If you have a hot water heater (or a hard drive), it’ll heat up right away; if it’s coming from your water main (disk or networking cable), it takes time to heat up. Either way, when you have finished using the water (data) in your bathtub (RAM) it drains and is gone when you’re done with it.

You must have telemetry and monitoring, Mays said, extending the metaphor, because “we also have to do plumbing while the water is flowing a lot of times. And if you have real-time systems and real-time consumers, you have to be able to divert those streams or store them and let it back up or to divert it around a different way,” while fixing it, in order to meet your service-level agreement.

4. Look for Outliers

As senior adviser for the Office of Management, Strategy and Solutions at the U.S. Department of State, Landon Van Dyke oversees the Internet of Things network for the whole department — including, but not limited to, all sensor data, smart metering, air monitoring, and vehicle telematics across offices, embassies and consulates, and residences. Across all resources and monitors, his team exclusively deals in high-frequency, real-time data, maintaining two copies of everything.

He takes a contrary perspective to Mays and shared it with The New Stack. With all data in real time, Van Dyke’s team is able to spot crucial outliers more often, and faster.

“You can probably save a little bit of money if you look at your utility bill at the end of the month,” Van Dyke said, explaining why his team took on its all-real-time strategy to uncover better patterns at a higher frequency. “But it does not give you the fidelity of what was operating at three in the afternoon on a Wednesday, the third week of the month.”

The specificity of energy consumption patterns are necessary to really make a marked difference, he argued. Van Dyke’s team uses that fidelity to identify when things aren’t working or something can be changed or optimized, like when a diplomat is supposed to be away but the energy usage shows that someone has entered their residence without authorization.

“Real-time data provides you an opportunity for an additional security wrapper around facilities, properties and people, because you understand a little bit more about what is normal operations and what is not normal,” he said. “Not normal is usually what gets people’s attention.”

5. Find Your Baseline

“When people see real-time data, they get excited. They’re like, ‘Hey, I could do so much if I understood this was happening!’ So you end up with a lot of use cases upfront,” Van Dyke observed. “But, most of the time, people aren’t thinking on the backend. Well, what are you doing to ensure that use case is fulfilled?”

Without proper planning upfront, he said, teams are prone to just slap on sensors that produce data every few seconds, connecting them to the internet and to a server somewhere, which starts ingesting the data.

“It can get overwhelming to your system real fast,” Van Dyke said. “If you don’t have somebody manning this data 24/7, the value of having it there is diminished.”

It’s not a great use of anyone’s time to pay people to stare at a screen 24 hours a day, so you need to set up alerts. But, in order to do that, you need to identify what an outlier is.

That’s why, he said, you need to first understand your data and set up your baseline, which could take up to six months, or even longer, when data points begin to have more impact on each other, like within building automation systems. You have to manage people’s expectations of the value of machine learning early on.

Once you’ve identified your baseline, you can set the outliers and alerts, and go from there.

6. Move to a Real-Time Ready Database

Still, at the beginning of this machine learning journey, Van Dyke says, most machines aren’t set up to handle that massive quantity of data. Real-time data easily overwhelms memory.

“Once you get your backend analysis going, it ends up going through a series of models,” Van Dyke said. “Most of the time, you’ll bring the data in. It needs to get cleaned up. It needs to go through a transformation. It needs to be run through an algorithm for cluster analysis or regression models. And it’s gotta do it on the fly, in real time.”

As you move from batch processing to real-time data, he continued, you quickly realize your existing system is not able to accomplish the same activities at a two- to five-second cadence. This inevitably leads to more delays, as your team has to migrate to a faster backend system that’s set up to work in real time.

This is why the department moved over to Kinetica’s real-time analytics database, which, Van Dyke said, has the speed built in to handle running these backend analyses on a series of models, ingesting and cleaning up data, and providing analytics. “Whereas a lot of the other systems out there, they’re just not built for that,” he added. “And they can be easily overwhelmed with real-time data.”

7. Use Standardization to Work with Non-Tech Colleagues

What’s needed now won’t necessarily be in demand in the next five years, Van Dyke predicted.

“For right now, where the industry is, if you really want to do some hardcore analytics, you’re still going to want people that know the coding, and they’re still going to want to have a platform where they can do coding on,” he said.
“And Kinetica can do that.”

He sees a lot of graphical user interfaces cropping up and predicts ownership and understanding of analytics will soon shift to becoming a more cross-functional collaboration. For instance, the subject matter expert (SME) for building analytics may now be the facilities manager, not someone trained in how to code. For now, these knowledge gaps are closed by a lot of handholding between data scientists and SMEs.

Standardization is essential among all stakeholders. Since everything real time is done at a greater scale, you need to know what your format, indexing and keys are well in advance of going down that rabbit hole.

This standardization is no simple feat in an organization as distributed as the U.S. State Department. However, its solution can be mimicked in most organizations — finance teams are most likely to already have a cross-organizational footprint in place. State controls the master dataset, indexes and meta data for naming conventions and domestic agencies, standardizing it across the government, which it based on the Treasury’s codes.

Van Dyke’s team ensured via logical standardization that “no other federal agency should be able to have its own unique code on U.S. embassies and consulates.”

8. Back-up in Real Time

As previously mentioned, the State Department also splits its data into two streams — one for model building and one for archival back-up. This still isn’t common practice in most real-time data-driven organizations, Van Dyke said, but it follows the control versus variable rule of the scientific method.

“You can always go back to that raw data and run the same algorithms that your real-time one is doing — for provenance,” he said. “I can recreate any outcome that my modeling has done, because I have the archived data on the side.” The State Department also uses the archived data for forensics, like finding patterns of motion around the building and then flagging deviations.

Yes, this potentially doubles the cost, but data storage is relatively inexpensive these days, he said.

The department also standardizes ways to reduce metadata repetition. For example, if a team wants to capture the speed of a fan in a building, but the metadata for that would include the fan’s make and model and the firmware for the fan controller. However, Van Dyke’s team exponentially reduces repetitive data in a table column by leveraging JSON to create nested arrays, which allows the team to decrease the amount of data by associating one note of firmware with all the speed logs.

It’s not just for real time, but in general, Van Dyke said: “You have to know your naming conventions, know your data, know who your stakeholders are, across silos. Make sure you have all the right people in the room from the beginning.”

Data is a socio-technical game, he noted. “The people that have produced the data are always protective of it. Mostly because they don’t want the data to be misinterpreted. They don’t want it to be misused. And sometimes they don’t want people to realize how many holes are in the data or how incomplete the data [is]. Either way, people have become very protective of their data. And you need to have them at the table at the very beginning.”

In the end, real-data best practices rely on collaboration across stakeholders and a whole lot of planning upfront.

The post 8 Real-Time Data Best Practices appeared first on The New Stack.

Raft Native: The Foundation for Streaming Data’s Best Future

Doug Flora — Tue, 30 May 2023 16:44:09 +0000

Consensus is fundamental to consistent, distributed systems. To guarantee system availability in the event of inevitable crashes, systems need a way to ensure each node in the cluster is in alignment, such that work can seamlessly transition between nodes in the case of failures.

Consensus protocols such as Paxos, Raft, View Stamped Replication (VSR), etc. help to drive resiliency for distributed systems by providing the logic for processes like leader election, atomic configuration changes, synchronization and more.

As with all design elements, the different approaches to distributed consensus offer different tradeoffs. Paxos is the oldest consensus protocol around and is used in many systems like Google Spanner, Apache Cassandra, Amazon DynamoDB and Neo4j.

Paxos achieves consensus in a three-phased, leaderless, majority-wins protocol. While Paxos is effective in driving correctness, it is notoriously difficult to understand, implement and reason about. This is partly because it obscures many of the challenges in reaching consensus (such as leader election, and reconfiguration), making it difficult to decompose into subproblems.

Raft (for reliable, replicated, redundant and fault-tolerant) can be thought of as an evolution of Paxos — focused on understandability. This is because Raft can achieve the same correctness as Paxos but is more understandable and simpler to implement in the real world, so often it can provide greater reliability guarantees.

For example, Raft uses a stable form of leadership, which simplifies replication log management. And its leader election process, driven through an elegant “heartbeat” system, is more compatible with the Kafka-producer model of pushing data to the partition leader, making it a natural fit for streaming data systems like Redpanda. More on this later.

Because Raft decomposes the different logical components of the consensus problem, for example by making leader election a distinct step before replication, it is a flexible protocol to adapt for complex, modern distributed systems that need to maintain correctness and performance while scaling to petabytes of throughput, all while being simpler to understand to new engineers hacking on the codebase.

For these reasons, Raft has been rapidly adopted for today’s distributed and cloud native systems like MongoDB, CockroachDB, TiDB and Redpanda to achieve greater performance and transactional efficiency.

How Redpanda Implements Raft Natively to Accelerate Streaming Data

When Redpanda founder Alex Gallego determined that the world needed a new streaming data platform — to support the kind of gigabytes-per-second workloads that bring Apache Kafka to a crawl without major hardware investments — he decided to rewrite Kafka from the ground up.

The requirements for what would become Redpanda were: 1) it needed to be simple and lightweight to reduce the complexity and inefficiency of running Kafka clusters reliably at scale; 2) it needed to maximize the performance of modern hardware to provide low latency for large workloads; and 3) it needed to guarantee data safety even for very large throughputs.

The initial design for Redpanda used chain replication: Data is produced to node A, then replicated from A to B, B to C and so on. This was helpful in supporting throughput, but fell short for latency and performance, due to the inefficiencies of chain reconfiguration in the event of node downtime (say B crashes: Do you fail the write? Does A try to write to C?). It was also unnecessarily complex, as it would require an additional process to supervise the nodes and push reconfigurations to a quorum system.

Ultimately, Alex decided on Raft as the foundation for Redpanda consensus and replication, due to its understandability and strong leadership. Raft satisfied all of Redpanda’s high-level design requirements:

Simplicity. Every Redpanda partition is a Raft group, so everything in the platform is reasoning around Raft, including both metadata management and partition replication. This contrasts with the complexity of Kafka, where data replication is handled by ISR (in-sync replicas) and metadata management is handled by ZooKeeper (or KRaft), and you have two systems that must reason with one another.

Performance. The Redpanda Raft implementation can tolerate disturbances to a minority of replicas, so long as the leader and a majority of replicas are stable. In cases when a minority of replicas have a delayed response, the leader does not have to wait for their responses to progress, mitigating impact on latency. Redpanda is therefore more fault-tolerant and can deliver predictable performance at scale.

Reliability. When Redpanda ingests events, they are written to a topic partition and appended to a log file on disk. Every topic partition then forms a Raft consensus group, consisting of a leader plus a number of followers, as specified by the topic’s replication factor. A Redpanda Raft group can tolerate ƒ failures given 2ƒ+1 nodes; for example, in a cluster with five nodes and a topic with a replication factor of five, two nodes can fail and the topic will remain operational. Redpanda leverages the Raft joint consensus protocol to provide consistency even during reconfiguration.

Redpanda also extends core Raft functionality in some critical ways to achieve the scalability, reliability and speed required of a modern, cloud native solution. Redpanda enhancements to Raft tend to focus on Day 2 operations, for instance how to ensure the system runs reliably at scale. These innovations include changes to the election process, heartbeat generation and, critically, support for Apache Kafka acks.

Redpanda’s optimistic implementation of Raft is what enables it to be significantly faster than Kafka while still guaranteeing data safety. In fact, Jepsen testing has verified that Redpanda is a safe system without known consistency problems and a solid Raft-based consensus layer.

But What about KRaft?

While Redpanda takes a Raft-native approach, the legacy streaming data platforms have been laggards in adopting modern approaches to consensus. Kafka itself is a replicated distributed log, but it has historically relied on yet another replicated distributed log — Apache ZooKeeper — for metadata management and controller election.

This has been problematic for a few reasons: 1) Managing multiple systems introduces administrative burden; 2) Scalability is limited due to inefficient metadata handling and double caching; 3) Clusters can become very bloated and resource intensive — in fact, it is not too uncommon to see clusters with equal numbers of ZooKeeper and Kafka nodes.

These limitations have not gone unacknowledged by Apache Kafka’s committers and maintainers, who are in the process of replacing ZooKeeper with a self-managed metadata quorum: Kafka Raft (KRaft).

This event-based flavor of Raft achieves metadata consensus via an event log, called a metadata topic, that improves recovery time and stability. KRaft is a positive development for the upstream Apache Kafka project because it helps alleviate pains around partition scalability and generally reduces the administrative challenges of Kafka metadata management.

Unfortunately, KRaft does not solve the problem of having two different systems for consensus in a Kafka cluster. In the new KRaft paradigm, KRaft partitions handle metadata and cluster management, but replication is handled by the brokers using ISR, so you still have these two distinct platforms and the inefficiencies that arise from that inherent complexity.

The engineers behind KRaft are upfront about these limitations, although some exaggerated vendor pronouncements have created ambiguity around the issue, suggesting that KRaft is far more transformative.

Combining Raft with Performance Engineering: A New Standard for Streaming Data

As data industry leaders like CockroachDB, MongoDB, Neo4j and TiDB have demonstrated, Raft-based systems deliver simpler, faster and more reliable distributed data environments. Raft is becoming the standard consensus protocol for today’s distributed data systems because it marries particularly well with performance engineering to further boost the throughput of data processing.

For example, Redpanda combines Raft with speedy architectural ingredients to perform at least 10 times faster than Kafka at tail latencies (p99.99) when processing a 1GBps workload, on one-third the hardware, without compromising data safety.

Traditionally, GBps+ workloads have been a burden for Apache Kafka, but Redpanda can support them with double-digit millisecond latencies, while retaining Jepsen-verified reliability. How is this achieved? Redpanda is written in C++, and uses a thread-per-core architecture to squeeze the most performance out of modern chips and network cards. These elements work together to elevate the value of Raft for a distributed streaming data platform.

Redpanda vs. Kafka with KRaft performance benchmark – May 11, 2023

An example of this in terms of Redpanda internals: Because Redpanda bypasses the page cache and the Java virtual machine (JVM) dependency of Kafka, it can embed hardware-level knowledge into its Raft implementation.

Typically, every time you write in Raft you have to flush to guarantee the durability of writes on disk. In Redpanda’s approach to Raft, smaller intermittent flushes are dropped in favor of a larger flush at the end of a call. While this introduces some additional latency per call, it reduces overall system latency and increases overall throughput, because it is reducing the total number of flush operations.

While there are many effective ways to ensure consistency and safety in distributed systems (Blockchains do it very well with Proof of Work and Statement of Work protocols), Raft is a proven approach and flexible enough that it can be enhanced to adapt to new challenges.

As we enter a new world of data-driven possibilities, driven in part by AI and machine learning use cases, the future is in the hands of developers who can harness real-time data streams. Raft-based systems, combined with performance-engineered elements like C++ and thread-per-core architecture, are driving the future of data streaming for mission-critical applications.

The post Raft Native: The Foundation for Streaming Data’s Best Future appeared first on The New Stack.

Why the Document Model Is More Cost-Efficient Than RDBMS

Rick Houlihan — Thu, 25 May 2023 16:24:22 +0000

A relational database management system (RDBMS) is great at answering random questions. In fact, that is why it was invented. A normalized data model represents the lowest common denominator for data. It is agnostic to all access patterns and optimized for none.

The mission of the IBM System R team, creators of arguably the first RDBMS, was to enable users to query their data without having to write complex code requiring detailed knowledge of how their data is physically stored. Edgar Codd, inventor of the RDBMS, made this claim in the opening line of his famous document, “A Relational Model of Data for Large Shared Data Banks”:

“Future users of large data banks must be protected from having to know how the data is organized in the machine.”

The need to support online analytical processing (OLAP) workloads drove this reasoning. Users sometimes need to ask new questions or run complex reports on their data. Before the RDBMS existed, this required software engineering skills and a significant time investment to write the code required to query data stored in a legacy hierarchical management system (HMS). RDBMS increased the velocity of information availability, promising accelerated growth and reduced time to market for new solutions.

The cost of this data flexibility, however, was significant. Critics of the RDBMS quickly pointed out that the time complexity, or the time required to query a normalized data model was very high compared to HMS. As such, it was probably unsuitable for the high-velocity online transaction processing (OLTP) workloads that consume 90% of IT infrastructure. Codd himself recognized the tradeoffs. The time complexity of normalization is also referred to in his paper on the subject:

“If the strong redundancies in the named set are directly reflected in strong redundancies in the stored set (or if other strong redundancies are introduced into the stored set), then, generally speaking, extra storage space and update time are consumed with a potential drop in query time for some queries and in load on the central processing units.”

This would probably have killed the RDBMS before the concept went beyond prototype if not for Moore’s law. As processor efficiency increased, the perceived cost of the RDBMS decreased. Running OLTP workloads on top of normalized data eventually became feasible from a total cost of ownership (TCO) perspective, and from 1980 to 1985, RDBMS platforms were crowned as the preferred solution for most new enterprise workloads.

As it turns out, Moore’s law is actually a financial equation rather than a physical law. As long as the market will bear the cost of doubling transistor density every two years, it remains valid.

Unfortunately for RDBMS technology, that ceased to be the case around 2013 when the cost of moving to 5 nanometers fab for server CPUs proved to be a near-insurmountable barrier to demand. The mobile market adopted 5nm technology to use as a loss leader, recouping the cost through years of subscription services associated with the mobile device.

However, there was no subscription revenue driver in the server processing space. As a result, manufacturers have been unable to ramp up 5nm CPU production and per-core server CPU performance has been flattening for almost a decade.

Last February, AMD announced that it is decreasing 5nm wafer inventory indefinitely going forward in response to weak demand for server CPUs due to high cost. The reality is that server CPU efficiency might not see another order-of-magnitude improvement without a generational technology shift, which could take years to bring to market.

All this is happening while storage cost continues to plummet. Normalized data models used by RDBMS solutions rely on cheap CPU cycles to enable efficient solutions. NoSQL solutions rely on efficient data models to minimize the amount of CPU required to execute common queries. Oftentimes this is accomplished by denormalizing the data, essentially trading CPU for storage. NoSQL solutions become more and more attractive as CPU efficiency flattens while storage costs continue to fall.

The gap between RDBMS and NoSQL has been widening for almost a decade. Fortune 10 companies like Amazon have run the numbers and gone all-in with a NoSQL-first development strategy for all mission-critical services.

A common objection from customers before they try a NoSQL database like MongoDB Atlas is that their developers already know how to use RDBMS, so it is easy for them to “stay the course.” Believe me when I say that nothing is easier than storing your data the way your application actually uses it.

A proper document data model mirrors the objects that the application uses. It stores data using the same data structures already defined in the application code using containers that mimic the way the data is actually processed. There is no abstraction between the physical storage or increased time complexity to the query. The result is less CPU time spent processing the queries that matter.

One might say this sounds a bit like hard-coding data structures into storage like the HMS systems of yesteryear. So what about those OLAP queries that RDBMS was designed to support?

MongoDB has always invested in APIs that allow users to run the ad hoc queries required by common enterprise workloads. The recent addition of an SQL-92 compatible API means that Atlas users can now run the business reports they need using the same tooling they have always used when connecting to MongoDB Atlas, just like any other RDBMS platform via ODBC (Open Database Connectivity).

Complex SQL queries are expensive. Running them at high velocity means hooking up a firehose to the capex budget. NoSQL databases avoid this problem by optimizing the data model for the high velocity queries. These are the ones that matter. The impact of this design choice is felt when running OLAP queries that will always be less efficient when executed on denormalized data.

The good news is nobody really cares if the daily report used to take 5 seconds to run, but now it takes 10. It only runs once a day. Similarly the data analyst or support engineer running an ad hoc query to answer a question will never notice if they get a result in 10 milliseconds vs. 100ms. The fact is OLAP query performance almost never matters, we just need to be able to get answers.

MongoDB leverages the document data model and the Atlas Developer Data Platform to provide high OLTP performance while also supporting the vast majority of OLAP workloads.

The post Why the Document Model Is More Cost-Efficient Than RDBMS appeared first on The New Stack.

Amazon Aurora vs. Redshift: What You Need to Know

Casey Samulski — Thu, 25 May 2023 13:27:00 +0000

Companies have an ever-expanding amount of data to sort through, analyze and manage. Fortunately, Amazon Web Services (AWS) provides powerful tools and services to help you manage data at scale, including Amazon Aurora and Amazon Redshift. But which service is best for you? Choosing a winner in the Aurora vs. Redshift debate requires careful consideration of each service’s strengths and limitations — and your business needs.

Learn more about each service’s benefits and what makes them different, as well as how to choose the service that’s best for your use cases.

Overview of Amazon Aurora

Amazon Aurora is a proprietary relational database management system developed by AWS. It’s a fully managed MySQL and PostgreSQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Aurora provides businesses with a secure and reliable database engine that meets the needs of modern applications. It’s highly available and offers up to five times the throughput of non-Aurora MySQL and PostgreSQL databases.

Aurora also offers a low-cost pricing structure, enterprise-grade security features and the ability to scale with ease. With this service, businesses can build and maintain sophisticated applications that require high performance, scalability and availability.

Overview of Amazon Redshift

Amazon Redshift is a petabyte-scale data warehouse service that can store and analyze vast amounts of data quickly and easily. Businesses use it to store and analyze large datasets in the cloud, enabling them to make better use of their data and gain deeper insights.

Redshift’s features make it well-suited for large-scale data-warehousing needs. It enables companies to analyze customer behavior, track sales performance and process log data, all of which are essential components of a data pipeline.

Redshift supports structured, semi-structured and unstructured data. It also provides advanced data compression and encoding capabilities to help businesses optimize storage and query performance. The service integrates with many data visualization and business intelligence tools.

Amazon Aurora vs. Redshift: Key Differences

The choice between Redshift and Aurora requires understanding how these powerful services differ, especially if you’re moving software-as-a-service platforms to AWS. Here are some distinctions between their key features and potential use cases.

OLTP vs. OLAP

One of the main differences between Aurora and Redshift is the type of workloads they’re designed for. Aurora is optimized for online transaction processing (OLTP) workloads, which involve processing small, fast transactions in real time. OLTP systems are typically used for operational tasks, such as inserting, updating and deleting data in a database. They’re designed to support high-volume, low-latency transactions. Data is stored in a normalized format to avoid redundancy.

Redshift is optimized for online analytical processing (OLAP) workloads, which involve processing complex, large-scale queries that require aggregation and analysis from multiple data sources. OLAP systems are typically used for data analysis and reporting tasks, such as generating sales reports or analyzing customer behavior. They’re designed to support complex queries that involve large amounts of data. Data is stored in a denormalized format to improve query performance.

Data Models and Storage

Amazon Aurora is a relational database engine that stores data in tables with rows and columns. This makes it ideal for storing transactional data, such as customer orders, inventory and financial records. For example, an e-commerce business that needs to house and analyze customer order data could be a great fit for Aurora.

Amazon Redshift is a columnar data store that’s optimized for analytical queries that involve large amounts of data. This includes business intelligence reporting, data warehousing and data exploration. For example, a media company needs to analyze advertisement impressions and engagement data to optimize an ad campaign.

Aside from the differences in data models, Aurora and Redshift also differ in their approach to data storage. Aurora uses a distributed storage model where data is stored across multiple nodes in a cluster. Meanwhile, Redshift uses a massively parallel processing (MPP) architecture, where data is divided into multiple slices and distributed across different nodes. This allows for faster data retrieval and processing, as each slice can be processed in parallel.

Performance

Aurora is optimized for transactional workloads, and it’s capable of delivering high performance for small, fast transactions that require low latency. This means it’s great for situations where fast response times are critical, such as online transactions or real-time data processing.

Redshift is optimized for analytical workloads and excels at processing complex queries that involve aggregating and analyzing large amounts of data from multiple sources. The columnar storage format used by Redshift enables efficient compression and fast query execution, while the MPP architecture enables high performance even with large datasets.

Scale

Both services are designed to scale horizontally, meaning you can add more nodes to increase processing power and storage capacity. Aurora also enables vertical scaling through upgrading instance types, while Redshift offers a concurrency scaling feature that essentially handles an unlimited number of concurrent users.

Amazon Aurora can scale up to 128 tebibytes of storage, depending on the engine, and up to 15 read replicas to handle high read traffic. Redshift can scale up to petabyte-scale data warehouses. Redshift also offers automatic scaling and workload management features, allowing you to easily add or remove nodes to handle changing workloads. It also provides local storage on each node, which reduces network traffic and improves performance.

Pricing

Aurora’s pricing is based on the amount of data stored, the compute and memory used, and the number of I/O requests made. The pricing model for Redshift is more complex, as it includes the type of instance that you choose, the number of compute nodes, the amount of storage and the duration of usage.

Difference between Aurora and Redshift:

OLTP vs. OLAP

Data models and storage

Performance

Scale

Pricing

Amazon Aurora vs. Redshift: Shared Benefits

Amazon Aurora and Redshift are each designed to help businesses manage their data in a secure and compliant manner. Here are some of the benefits they have in common.

High Performance

Amazon Aurora databases have a distributed architecture that delivers high performance and availability for transactional workloads. Unlike traditional databases, which can experience performance issues when scaling, Aurora’s architecture is specifically designed to scale out and maintain performance.

Amazon Redshift’s MPP architecture allows it to distribute processing tasks across multiple nodes to handle petabyte-scale data warehouses with ease. It can process large queries quickly, making it a great choice for analytical workloads. Additionally, Redshift provides advanced performance-tuning options, such as automatic query optimization and workload management.

Scalability

Amazon Aurora provides automatic scaling capabilities for compute and storage resources. Using this service means you can easily increase or decrease the number of database instances to meet changes in demand. Aurora also supports read replicas, allowing you to offload read queries from the primary instance to one or more replicas, which improves performance and enables horizontal scale reads.

Amazon Redshift provides elastic scaling, which allows you to easily add or remove compute nodes as your data warehouse grows, all while handling more concurrent queries and improving query performance. Redshift provides automatic distribution and load balancing of data across nodes, which improves performance and scalability.

Cost-Effective

With Amazon Aurora and Redshift, you only pay for the resources you use, allowing you to scale up or down without additional financial concerns.

Amazon Aurora offers a low-cost pricing structure compared to traditional commercial databases, with no upfront costs or long-term commitments. Redshift offers reserved instance pricing options, which saves money if you can commit to using the service for a longer period.

Security

As part of the AWS portfolio, Redshift and Aurora share valuable security features such as multifactor authentication, automated backups and encryption for data at rest and in transit.

You can control who can access your resources using AWS Identity and Access Management (IAM). Permissions can be customized for specific users, groups or roles while controlling access at the network level. You can set up security groups to allow traffic only from trusted sources and configure access control policies to ensure that users are granted the appropriate level of access.

Benefits of Amazon Aurora and Redshift:

High Performance

Scalability

Cost-Effective

Security

Amazon Redshift vs. Aurora: Cost Comparisons

Storage Costs

There are two key factors to consider when assessing storage costs: per-gigabyte pricing and reserved node fees. Both services offer per-gigabyte pricing, which means you pay for usage rather than for access. However, Redshift charges additional fees for reserved nodes, which are required for running a cluster with predictable performance and capacity.

Despite this fee structure, Redshift can be a more cost-effective option for larger datasets because of its advanced compression algorithms and columnar storage technologies. These technologies enable Redshift to store and query large amounts of data efficiently, resulting in lower storage costs and faster query performance.

Compute Costs

Compute costs include I/O operations, such as scanning or sorting large datasets, among other elements. Redshift’s advanced compression algorithms and columnar storage improve the efficiency of processing complex analytics, which reduces the number of I/O operations required and results in lower compute costs.

Disk I/O latency, or the time it takes for the storage system to retrieve data, can also affect compute costs. Aurora generally has lower latency compared to Redshift because of its architecture. Query performance is another consideration. Redshift can handle more complex queries than Aurora and may be required for businesses with complex analytical needs, although higher compute costs are possible.

Generally speaking, Aurora is better suited for reading and writing operations while Redshift offers superior capabilities in terms of complex analytics processing, making it worth investing in higher cluster sizes with Redshift over Aurora’s 64 nodes per cluster limit.

Data Transfer Fees

In addition to storage and compute expenses, factor in data transfer fees that might apply. Data transfers within AWS regions are free of charge, but transferring between regions has associated costs depending on how much data needs to be transmitted.

Amazon Aurora vs. Redshift: What’s Right for You?

Businesses looking at Aurora, Redshift and similar services need to understand how these offerings fit with their workload requirements. Aurora advantages include its sub-second latency read/write operations, making it suitable for tasks such as online transaction processing. Its multi-availability-zone deployment capabilities offer high availability and fault tolerance. But Aurora does have limitations. Certain functions such as sharding and table partitioning may not be supported by this platform because of its maximum cluster size limit.

Amazon Redshift shines when executing complex analytical queries on vast datasets, making it perfect for data-warehousing duties like business intelligence or analytics. Its columnar storage, data compression and high-performance capabilities allow it to handle intense workloads. Redshift’s scalability options are limited, which could be a disadvantage for large cloud solutions on AWS.

Besides the features and pricing, consider what kind of implementation needs your business requires. Amazon Aurora has quick setup times compared to on-premises solutions, while deploying a Redshift cluster can take days or even weeks.

After you’ve determined the right solution, you’ll need to migrate data to the new database. Amazon offers several migration tools to help you transfer your data, including Amazon Database Migration Service (DMS).

Throughout this process, partnering with a premier tier services partner such as Mission Cloud can ensure a smooth and successful migration. Want to learn more? See how Amazon DMS makes database migration easy.

The post Amazon Aurora vs. Redshift: What You Need to Know appeared first on The New Stack.

Boost DevOps Maturity with a Data Lakehouse

Guido Deinhammer — Wed, 17 May 2023 17:20:53 +0000

In a world riven by macroeconomic uncertainty, businesses increasingly turn to data-driven decision-making to stay agile.

That’s especially true of the DevOps teams tasked with driving digital-fueled sustainable growth. They’re unleashing the power of cloud-based analytics on large data sets to unlock the insights they and the business need to make smarter decisions. From a technical perspective, however, that’s challenging. Observability and security data volumes are growing all the time, making it harder to orchestrate, process, analyze and turn information into insight. Cost and capacity constraints are becoming a significant burden to overcome.

Data Scale and Silos Present Challenges

DevOps teams are often thwarted in their efforts to drive better data-driven decisions with observability and security data. That’s because of the heterogeneity of the data their environments generate and the limitations of the systems they rely on to analyze this information.

Most organizations are battling cloud complexity. Research has found that 99% of organizations have embraced a multicloud architecture. On top of these cloud platforms, they’re using an array of observability and security tools to deliver insight and control — seven on average. This results in siloed data that is stored in different formats, adding further complexity.

This challenge is exacerbated by the high cardinality of data generated by cloud native, Kubernetes-based apps. The sheer number of permutations can break traditional databases.

Many teams look to huge cloud-based data lakes, a repository that stores data in its natural or raw format, to help teams centralize disparate data. A data lake enables teams to keep as much raw, “dumb” data as they wish, at relatively low cost, until teams in the business find a use for it.

When it comes to extracting insight, however, data needs to be transferred to a warehouse technology so it can be aggregated and prepared before it is analyzed. Various teams usually then end up transferring the data again to another warehouse platform, so they can run queries related to their specific business requirements.

When Data Storage Strategies Become Problematic

Data warehouse-based approaches add cost and time to analytics projects.

As many as tens of thousands of tables may need to be manually defined to prepare data for querying. There’s also the multitude of indexes and schemas needed to retrieve and structure the data and define the queries that will be asked of it. That’s a lot of effort.

Any user who wants to ask a new question for the first time will need to start from scratch to redefine all those tables and build new indexes and schemas, which creates a lot of manual effort. This can add hours or days to the process of querying data, meaning insights are at risk of being stale or are of limited value by the time they’re surfaced.

The more cloud platforms, data warehouses and data lakes an organization maintains to support cloud operations and analytics, the more money they will need to spend. In fact, the storage space required for the indexes used to support data retrieval and analysis may end up costing more than the data storage itself.

Further costs will arise if teams need technologies to track where their data is and to monitor data handling for compliance purposes. Frequently moving data from place to place may also create inconsistencies and formatting issues, which could affect the value and accuracy of any resulting analysis.

Combining Data Lakes and Data Warehouses

A data lakehouse approach combines the capabilities of a warehouse and a lake to solve the challenges associated with each architecture, thanks to its enormous scalability and massively parallel processing capabilities. With a data lakehouse approach to data retention, organizations can cope with high-cardinality data in a time- and cost-effective manner, maintaining full granularity and extra-long data retention to support instant, precise and contextual predictive analytics.

But to realize this vision, a data lakehouse must be schemaless, indexless and lossless. Being schema-free means users don’t need to predetermine the questions they want to ask of data, so new queries can be raised instantly as the business need arises.

Indexless means teams have rapid access to data without the storage cost and resources needed to maintain massive indexes. And lossless means technical and business teams can query the data with its full context in place, such as interdependencies between cloud-based entities, to surface more precise answers to questions.

Unifying Observability Data

Let’s consider the key types of observability data that any lakehouse must be capable of ingesting to support the analytics needs of a modern digital business.

Logs are the highest volume and often most detailed data that organizations capture for analytics projects or querying. Logs provide vital insights to verify new code deployments for quality and security, identify the root causes of performance issues in infrastructure and applications, investigate malicious activity such as a cyberattack and support various ways of optimizing digital services.
Metrics are the quantitative measurements of application performance or user experience that are calculated or aggregated over time to feed into observability-driven analytics. The challenge is that aggregating metrics in traditional data warehouse environments can create a loss of fidelity and make it more difficult for analysts to understand the relevance of data. There’s also a potential scalability challenge with metrics in the context of microservices architectures. As digital services environments become increasingly distributed and are broken into smaller pieces, the sheer scale and volume of the relationships among data from different sources is too much for traditional metrics databases to capture. Only a data lakehouse can handle such high-cardinality data without losing fidelity.
Traces are the data source that reveals the end-to-end path a transaction takes across applications, services and infrastructure. With access to the traces across all services in their hybrid and multicloud technology stack, developers can better understand the dependencies they contain and more effectively debug applications in production. Cloud native architectures built on Kubernetes, however, greatly increase the length of traces and the number of spans they contain, as there are more hops and additional tiers such as service meshes to consider. A data lakehouse can be architected such that teams can better track these lengthy, distributed traces without losing data fidelity or context.

There are many other sources of data beyond metrics, logs, and traces that can provide additional insight and context to make analytics more precise. For example, organizations can derive dependencies and application topology from logs and traces.

If DevOps teams can build a real-time topology map of their digital services environment and feed this data into a lakehouse alongside metrics, logs and traces, it can provide critical context about the dynamic relationships between application components across all tiers. This provides centralized situational awareness that enables DevOps teams to raise queries about the way their multicloud environments work so they can understand how to optimize them more effectively.

User session data can also be used to gain a better understanding of how customers interact with application interfaces so teams can identify where optimization could help.

As digital services environments become more complex and data volumes explode, observability is certainly becoming more challenging. However, it’s also never been more critical. With a data lakehouse-based approach, DevOps teams can finally turn petabytes of high-fidelity data into actionable intelligence without breaking the bank or becoming burnt out in the effort.

The post Boost DevOps Maturity with a Data Lakehouse appeared first on The New Stack.

Vercel Offers Postgres, Redis Options for Frontend Developers

Loraine Lawson — Mon, 01 May 2023 16:00:40 +0000

Increasingly, cloud provider Vercel is positioning itself as a one-stop for frontend developers. A slew of announcements this week makes that direction clear by adding to the platform a suite of serverless storage options, as well as new security and editing features.

“Basically, for the longest time, frontend developers have struggled to come to define how you put together these best-in-class tools into a single platform,” Lee Robinson, Vercel’s vice president of developer experience, told The New Stack. “The idea here really is what would storage look like if it was reimagined from the perspective of a frontend developer.”

All of the announcements will be explored in a free upcoming online conference of sorts later this week.

Rethinking Storage for Frontend Developers

Vercel wanted to think about storage that works with new compute primitives, such as serverless and edge — functions that mean frontend developers don’t have to think through some of the more traditional ways of connecting to a database, Robinson said.

Developers are moving away from monolithic database architectures and embracing distributed databases “that can scale and perform in the cloud,” the company said in its announcement. Vercel also wants to differentiate by integrating storage with JavaScript frameworks, such as Next.js, Sveltekit or Nuxt, Robinson said.

The new options came out of conversations in which developers said they wanted first-party integration with storage and a unified way to handle billing and usage, a single account to manage both their compute as well as their storage, all integrated into their frontend framework and their frontend cloud, Robinson added.

“Historically, frontend developers — trying to retrofit databases that were designed for a different era — have struggled to integrate those in modern frontend frameworks,” Robinson said. “They have to think about manually setting up connection pooling as their application scales in size and usage. They have to think about dialing the knobs for how much CPU or storage space they’re allotting for their database. And for a lot of these developers, they just want a solution that more or less works out of the box and scales with them as their site grows.”

The three storage products Vercel announced this week are:
1. Vercel Postgres, through a partnership with Neon.

“Postgres is an incredible technology. Developers love it,” Robinson said. “We wanted to build on a SQL platform that was reimagined for serverless and that could pair well with Vercel as platform, and that’s why we chose to have the first-party integration with Neon, a serverless database platform, a serverless Postgres platform.”

The integration will give developers access to a fully managed, highly scalable, truly serverless fault-tolerant database, which will offer high performance and low latency for web applications, the company added. Vercel Postgres is designed to work seamlessly with the Next.js App Router and Server Components, which allow web apps to fetch data from the database to render dynamic content on the server, Vercel added.

2. Vercel KV, a scalable, durable Redis-compatible database.

Redis is used for key-value store in frontend development. Like Postgres, Redis is one of the top-rated databases and caches for developers, he said. Developer loves its flexibility and API and the fact it’s open source, he said.

“These databases can be used for rate limiting, session management and application state,” Vercel stated in its press release. “With Vercel KV, frontend developers don’t need to manage scaling, instance sizes or Redis clusters — it’s truly serverless.”

Vercel’s lightweight SDK works from edge or serverless functions and scales with a brand’s traffic.

“The interesting thing here — and what I’m really excited about with this one — is that traditionally, a lot of Redis instances would be ephemeral. So you would use them as a cache, you would store some data in them, and that cache would expire,” Robinson said. “The cool thing about durable storage, or our durable Vercel KV for Redis, is that you can actually use it like a database. You can store data in there and it will persist. So developers get the power and the flexibility that they love from Redis.”

3. Vercel Blob, a secure object storage, which has been one of the top requests from the Vercel community. Vercel Blob offers file storage in the cloud using an API built on Web standard APIs, allowing users to upload files or attachments of any size. It will enable companies to host medium complex apps entirely on Vercel without the need of a separate backend or database provider.

“Vercel Blob is effectively a fast and simple way to upload files,” Robinson said. “We’re working in partnership with Cloudflare and using their R2 product that allows you to effectively very easily upload and store files in the cloud, and have a really simple API that you can use; again, that works well with your frontend frameworks to make it easy to store images or any other type of file.”

Each offers developers an easy way to solve different types of storage problems, he said.

“If you step back and you look at the breadth of the storage products that we’re having these first-party integrations for, we’re trying to give developers a convenient, easy way to solve all of these different types of storage solutions,” Robinson said.

New Security Offerings from Vercel

Along with Vercel’s new storage products, the frontend cloud provider has also launched Vercel Secure Compute, which gives businesses the ability to create private connections between serverless functions and protect their backend cloud. Previously, companies had to allow all IP addresses on their backend cloud for a deployment to be able to connect with it, Vercel explained. With Vercel Secure Compute, the deployments and build container will be placed in a private network with a dedicated IP address in the region of the user’s choice and logically separated from other containers, the press release stated.

“Historically on the Vercel platform, you’ve had your compute, which is serverless functions or edge functions, and when we talk to our largest customers, our enterprise customers, they love the flexibility that offers, but they wanted to take it a step further and add additional security controls on top,” Robinson said. “To do that, we’ve offered a product called Vercel Secure Compute, which allows you to really isolate that compute and put it inside of the same VPC [virtual private cloud] as the rest of your infrastructure.”

It’s targeting large teams who have specific security rules or compliance rules and want additional control over their infrastructure, he added. Along with that, they introduced Vercel Firewall, with plans to introduce a VPN at some point in the future.

“The same customers when they’re saying, ‘I want more control, more granularity over my compute,’ they also want more control over the Vercel Edge network, and how they can allow or block traffic. So with Vercel firewall we’re giving our enterprise customers more flexibility for allowing or blocking specific IP addresses,” Robinson said.

Visual Editing Pairs with Comments on Preview

The company also released Vercel Visual Editing, which dovetails on the company’s release in December of Comments on Preview Deployments. Visual Editing means developers can work with non-technical colleagues and across departments to live-edit site content. To do that, Vercel partnered with Sanity, a real-time collaboration platform for structured content, to introduce a new open standard for content source mapping for headless CMS [content management systems]. The new standard works with any framework and headless CMS, the company added.

Vercel used it for the blog posts it’s creating about the new announcements, collectively nicknamed Vercel Ship, allowing the team to edit the content.

“The way that visual editing pairs into this, it actually works in harmony with Comments,” he said. “So for example, all of the blog posts that we’re working on for this upcoming Vercel Ship week, we’re using a combination of comments, as well as visual editing to allow our teams to give feedback say, ‘Let’s change this word here to a different word. Let’s fix this typo.’ Then the author or the editors can go and click the edit button go in make those changes directly and address the comment.”

The post Vercel Offers Postgres, Redis Options for Frontend Developers appeared first on The New Stack.

Why We Need an Inter-Cloud Data Standard

Aaron Schneider — Thu, 27 Apr 2023 15:19:18 +0000

The cloud has completely changed the world of software. Everyone from startup to enterprise has access to vast amounts of compute hardware whenever they want it. Need to test the latest version of that innovative feature you were working on? Go ahead. Spin up a virtual machine and set it free. Does your company need to deploy a disaster recovery cluster? Sure, if you can fit the bill.

No longer are we limited by having to procure expensive physical servers and maintain on premises data centers. Instead, we are free to experiment and innovate. Each cloud service provider has a litany of tools and systems to help accelerate your modernization efforts. The customer is spoiled with choices.

Except that they aren’t. Sure, a new customer can weigh the fees and compare the offerings. But that might be their last opportunity, because once you check in, you can’t check out.

The Hotel California Problem

Data is the lifeblood of every app and product, as all software, at its core, is simply manipulating data to create an experience for the user. And thankfully, cloud providers will happily take your precious data and keep it safe and secure for you. The cost to upload your data is minimal. However, when you want to take your ever-growing database and move it somewhere else, you will be in for a surprise: the worst toll road in history. Heavy bills and slow speeds.

Is there a technical reason? Let’s break it down. Just like you and me, the big cloud providers pay for their internet infrastructure and networking, and that must be factored into their business model. Since the costs can’t be absorbed, it is worth subsidizing the import fees and taxing the exports.

Additionally, the bandwidth of their internet networks is also limited. It makes sense to deprioritize the large data exports so that production application workloads are not affected.

Combine these factors and you can see why Amazon Web Services (AWS) offers a service where it sends a shipping container full of servers and data drives so you can migrate data to its services. It is often cheaper and faster to mail a hard drive than it is to download its contents from the web.

It helps that all these factors align with the interests of the company. When a large data export is detected, it probably is a strong indicator that the customer wants to lower their fees or take their business elsewhere. It is not to the cloud provider’s benefit to make it easy for customers to move their data out.

Except that it is in the cloud provider’s interest. It’s in everyone’s interest.

It Really Matters

The situation is not unlike the recent revolution in the smart home industry. Since its inception, it has been a niche enthusiast hobby. But in 2023, it is poised to explode.

Amazon, Google and Apple have ruthlessly pursued this market for years, releasing products designed to coordinate your home. They have tried to sell the vision of a world where your doors are unlocked by Alexa, where Siri watches your cameras for intruders and where Google sets your air conditioning to the perfect temperature. But you were only allowed one assistant. Alexa, Siri or Google.

By design, there was no cross compatibility; you had to go all in. This meant that companies who wanted to develop smart home products also had to choose an ecosystem, as developing a product that works with and is certified for all three platforms was prohibitively expensive. Product boxes had a ridiculous number of logos on them indicating which systems they work with and what protocol they operate on.

Source: Google Images

It was a minefield for consumers. The complexity of finding products that work with your system was unbearably high and required serious planning and research. It was likely you would walk out of the shop with a product that wouldn’t integrate with your smart home assistant.

This will change. In late 2022, products certified on the new industry standard, named Matter, started hitting shelves, and they work with all three ecosystems. No questions asked, and only one logo to look for. This reduces consumer complexity. It makes developing products easier, and it means that the smart home market can grow beyond a niche hobby for technology enthusiasts and into the mass market. By 2022, only 14% of households had experimented with smart technology. However, in the next four years, adoption of smart home technology is set to double, with another $100 billion of revenue being generated by the industry.

Bringing It Back to Cloud

We must look at it from the platform vendor’s perspective. Before Matter, users had to choose, and if they chose your ecosystem, it was a big win! Yet the story isn’t that simple, as the customers were left unfulfilled, limited to a small selection of products that they could use. Worse, the friction that this caused limited the size of the market and ensured that even if the vendor improved its offering, it was unlikely to cause a competitor’s customers to switch.

In this case, lock-in was so incredibly detrimental to the platform owners that all the players in the space acknowledged the existential threats to the budding market, driving traditionally bitter rivals to rethink, reorganize and build a new, open ecosystem.

The cloud service providers (CSPs) are in a similar position. The value proposition of the cloud was originally abundantly clear, and adoption exploded. Today, sentiment is shifting, and the cloud backlash has begun. After 10 years of growing cloud adoption, organizations are seeing their massive cloud bills continue to grow, with an expected $100 billion increase in spending in 2023 alone and cloud lock-in is limiting agility.

With so much friction in moving cloud data around, it might be easier for customers to never move data there and just manage the infrastructure themselves.

The value still exists for sporadic workloads, or development and innovation, as purchasing and procurement is a pain for these sorts of use cases. Yet, even these bleeding-edge use cases can be debilitated by lock-in. Consider that there may be technology offered by AWS and another on Google Cloud that together could solve a key technical challenge that a development team faces. This would be a win-win-win. Both CSPs would gain valuable revenue, and the customer would be able to build their technology. Unfortunately, today this is impossible as the data transfer costs make this unreasonably expensive.

There are second-order effects as well. Each CSP currently must invest in hundreds of varying services for their customers. As for each technology category, the cloud provider must offer a solution to its locked-in customers. This spreads development thin, perhaps limiting the excellence of each individual service since many of them need to be developed and supported. As thousands of employees are let go by Amazon (27,000), Google (12,000) and Microsoft (10,000), can these companies really keep up the pace? Wouldn’t quality and innovation go up if these companies could focus their efforts on their differentiators and best-in-class solutions? Customers could shop at multiple providers and always get the best tools for their money.

High availability is another key victim to the current system. Copies of the data must be stored and replicated in a set of discrete locations to avoid data loss. Yet, data transfer costs means that the cost of replicating data between availability zones internally within a single cloud region already drives up the bill. Forget replicating any serious amount of data between cloud providers as that becomes infeasible due to cost and latency. This places real limits on how well customers can protect their data from disasters or failures, artificially capping risk mitigations.

An Industry Standard

So many of today’s cloud woes come down to the data-transfer cost paradigm. The industry needs a rethink. Just like the smart home companies came together to build a single protocol called Matter, perhaps the CSPs could build a simple, transparent and unified system for data transfer fees.

The CSPs could invest in building an inter-cloud super highway: an industry-owned and -operated infrastructure designed solely for moving data between CSPs with the required throughput. Costs would go down as the public internet would no longer be a factor.

A schema could be developed to ensure interoperability between technologies and reduce friction for users looking to migrate their data and applications. An encryption standard could be enforced to ensure security and compliance and use of the aforementioned cross-cloud network would reduce risk of interception by malicious actors. For critical multicloud applications, customers could pay a premium to access faster inter-cloud rates.

Cloud providers would be able to further differentiate their best product offerings knowing that if they build it, the customers will come, no longer locked into their legacy cloud provider.

Customers could avoid lengthy due diligence when moving to the cloud, as they could simply search for the best service for their requirements, no longer buying the store when they just need one product. Customers would benefit from transparent and possibly reduced costs with the ability to move their business when and if they want to. Overall agility would increase, allowing strategic migration on and off the cloud as requirements change.

And of course, a new level of data resilience could be unlocked as data could be realistically replicated back and forth between different cloud providers, ensuring the integrity of the world’s most important data.

This is a future where everyone wins. The industry players could ensure the survival and growth of their offerings in the face of cloud skepticism. Customers get access to the multitudes of benefits listed above. Yes, it would require historic humility and cooperation from some of the largest companies in the world, but together they could usher in a new generation of technology innovation.

We need an inter-cloud data mobility standard.

In the Meantime

Today there is no standard, and all the opposite is true. The risks of cloud lock-in are high, and customers must mitigate them by leveraging the cloud carefully and intelligently. Data transfer fees cannot be avoided, but there are other ways to lower your exposure.

That’s why Couchbase enables its NoSQL cloud database to be used in a multitude of different ways. You can manage it yourself, or use the Autonomous Operator to deploy it on any Kubernetes infrastructure (on premises or in the cloud). We also built our database-as-service, Capella, to natively run on all three major cloud platforms.

Couchbase natively leverages multiple availability zones and its internal replication technology to ensure high availability alongside multiple availability zones. With Cross Datacenter Replication (XDCR), you can asynchronously replicate your data across regions and even cloud platforms themselves to ensure your data is safe even in the worst-case scenarios.

Try Couchbase Capella today with a free trial and no commitment.

The post Why We Need an Inter-Cloud Data Standard appeared first on The New Stack.

We Designed Our Chips with First Pass Success — and So Can You

Parag Madhani — Wed, 26 Apr 2023 17:00:12 +0000

Like most thrilling adventures, this one began with a question: When interviewing for my current job at ScaleFlux, a computational storage vendor, in early 2019, my future boss asked me, “Do you think we can design our own chips?” For a small startup like ScaleFlux, it might seem like an insurmountable project. But, appreciating a challenge, I responded enthusiastically, “Yes!”

What are the benefits? You gain flexibility and increase sustainability. You level the playing field so you can compete with major global technology companies. You also retain control of your architecture rather than giving it to someone else. After all, your architecture is your company’s secret sauce.

Another advantage is time to market. If you don’t have control over your chip design, a change in your assembled final product could result in a six-month delay in production — or longer — because of the redesign and validation steps involved.

Most of the strategies involved in chip design are the same regardless of what type of chip you build. However, at ScaleFlux, we were specifically interested in designing a system on a chip (SoC) for next-generation SSD drives because it can combine many aspects of a computer system into a single chip.

My goal for our chip design was “A-0” success, meaning immediate production readiness, with no modification or redesign necessary. Producing a market-ready A-0 chip is a monumental challenge barely attempted by even the largest, most sophisticated tech companies. Still, it is achievable with the right mentality and a “can do” attitude.

As we discovered later, stalled capacity at fabs (the fabrication plants manufacturing semiconductors) provided further incentive to be in control of our own destiny. The global economy struggled with the chip shortage, which resulted in shipping challenges caused by shortages of the materials used to package the chips.

Yet, because we had reserved our position in the fabs, we could weather the worst of the chip shortage as competitors floundered.

We designed a working chip on our first try, and I am proud of our success. Let me tell you that now is a fantastic time to get started with designing your chips, and I want to see others accomplish the same goal. While a smart — and likely profitable — venture, designing your own chips isn’t easy. It requires a substantial initial investment of time and money. I experienced that for myself when ScaleFlux took on the challenge. There were bumps along the way, but looking back, it’s all been worth it — and the ROI took just a couple of years.

So, allow me to share six strategies I found tremendously helpful in streamlining the chip design process:

1. Commit Enough Resources to the Project

The initial investment includes building your team, purchasing simulation and design technology tools and contracting with production partners to develop your chips. While the schedule is king, quality is queen, so build enough time into your schedule to keep the quality high.

ScaleFlux’s investment was 10s of millions of dollars, and the project took a year and a half. Whatever your level of investment, allocate a sufficient amount for success and never cut corners. After all, there’s no patch for most hardware failures.

2. Gather a Skilled Team

Every person on our 50-person team contributed to the project’s success. I didn’t need a lone hero who could “do it all.” I needed a cross-functional group of humble, hard-working people who could admit what they didn’t know, communicate effectively and focus on the company’s objectives over their egos. I’ll never forget the big smiles on team members’ faces at the party to celebrate the successful validation of our first chips.

3. Pick the Right Partners

Select technology partners with a track record of success in specialized domains, such as physical design, testability and packaging. Qualified semiconductor engineers are a hot commodity in the US, so be flexible about where you source your talent. Your next team member could be an up-and-coming intern still in college or a semiconductor engineer who lives on the other side of the globe. For the verification team, aim for a minimum 1:2 ratio of designers to verification engineers. And as your team achieves, reward the team rather than individual members.

4. Adopt a Leadership Mindset

Agility is a significant advantage at a startup — you shouldn’t waste time making decisions. Stepping outside your comfort zone as a company and designing new technology is risky. But with the right mindset and appropriate resources, you can build a foundation of discipline and thorough planning to take calculated risks that bring tremendous gains.

While we limited architectural changes after the design kickoff, chip design involves a million small tasks. If just one of them goes wrong — like a team member misreading a design spec and making an error — your chip won’t work. We automated process steps whenever possible and used numerous checklists — following the adage to trust but verify — and those checklists at every significant design phase kept us on task during countless reviews.

5. Seek Outside Expertise When Needed

Large companies sometimes want to do everything in-house and resist partnering with outside resources. When they do, they are often limited to working with a short list of approved partners. Startups and smaller companies are more comfortable reaching out for expertise because of headcount limitations and have more flexibility to work with promising startups.

Our partnerships on our chip design initiative let us tap into some of the industry’s top chip design and production experts and benefit from their knowledge and connections. One supplier had paid years in advance for production line capacity, which proved valuable. Another was a relatively unknown, small company started by a talented lead engineer I’d worked with for over 15 years at a previous company.

6. Resist Being Dazzled by “Cool Technology.”

Instead of prioritizing cost, I prioritized quality and production readiness in components. Some employees and board members urged me to use new technology in our chip design. I resisted. My first rule for my team was to limit ourselves to only selecting established, market-tested intellectual property (IP) when sourcing technology for our chips.

All the IP we used when designing our chips were production-worthy rather than new. I chose older, time-tested technology over shiny, new technology. I wanted IP that had pushed dozens of previous SoCs to production and contributed to millions of shipped chips.

You, Too, Can Design Your Own Chips

With supply chain issues continuing and China dominating the chip fabrication market, companies can save money and gain a competitive edge by handling the first half of chip manufacturing — chip design — themselves.

Now, ScaleFlux is working on technology focused on reducing costs and designing next-generation chips for our future products. I hope our story inspires you. We’re a technology startup without the resources of an established company. We saw a need — and recognized the significant benefits we could realize — and we made it happen.

So can you.

The post We Designed Our Chips with First Pass Success — and So Can You appeared first on The New Stack.

ACID Transactions Change the Game for Cassandra Developers

Aaron Ploetz — Wed, 26 Apr 2023 15:33:43 +0000

For years, Apache Cassandra has been solving big data challenges such as horizontal scaling and geolocation for some of the most demanding use cases. But one area, distributed transactions, has proven particularly challenging for a variety of reasons.

It’s an issue that the Cassandra community has been hard at work to solve, and the solution is finally here. With the release of Apache Cassandra version 5.0, which is expected later in 2023, Cassandra will offer ACID transactions.

ACID transactions will be a big help for developers, who have been calling for more SQL-like functionality in Cassandra. This means that developers can avoid a bunch of complex code that they used for applying changes to multiple rows in the past. And some applications that currently use multiple databases to handle ACID transactions can now rely solely on Cassandra to solve their transaction needs.

What Are ACID Transactions and Why Would You Want Them?

ACID transactions adhere to the following characteristics:

Atomicity — Operations in the transaction are treated as a single unit of work and can be rolled back if necessary.
Consistency — Different from the “consistency” that we’re familiar with from the CAP Theorem, this is about upholding the state integrity of all data affected by the transaction.
Isolation — Assuring that the data affected by the transaction cannot be interfered with by competing operations or transactions.
Durability — The data will persist at the completion of the transaction, even in the event of a hardware failure.

While some NoSQL databases have managed to implement ACID transactions, they traditionally have only been a part of relational database management systems (RDBMS). One reason for that: RDBMSs historically have been contained within a single machine instance. The reality of managing database operations is that it’s much easier to provide ACID properties when everything is happening within the bounds of one system. This is why the inclusion of full ACID transactions into a distributed database such as Cassandra is such a big deal.

The advantage of ACID transactions is that multiple operations can be grouped together and essentially treated as a single operation. For instance, if you’re updating several points of data that depend on a specific event or action, you don’t want to risk some of those points being updated while others aren’t. ACID transactions enable you to do that.

Example Transaction

Let’s look at a game transaction as an example. Perhaps we’re playing one of our favorite board games about buying properties. One of the players, named “Avery,” lands on a property named “Augustine Drive” and wants to buy it from the bank for $350.

There are three separate operations needed to complete the transaction:

Deduct $350 from Avery
Add $350 to the bank
Hand ownership of Augustine Drive to Avery

ACID transactions will help to ensure that:

Avery’s $350 doesn’t disappear
The bank doesn’t just receive $350 out of thin air
Avery doesn’t get Augustine Drive for free

Essentially, an ACID transaction helps to ensure that all parts of this transaction are either applied in a consistent manner or rolled back.

Consensus with Accord

Cassandra will be able to support ACID transactions thanks to the Accord protocol. As a part of the Cassandra Enhancement Process, CEP-15 introduces general-purpose transactions based on the Accord white paper. The main points of the CEP-15 are:

Implementation of the Accord consensus protocol
Strict, serializable isolation
The best attempts will be made to complete the transaction in one round trip
Operation over multiple partition keys

With the Accord consensus protocol, each node in a Casandra cluster has a structure called a “reorder buffer.” This buffer is designed to hold transaction timestamps for the future.

Figure 1: A coordinator node presenting a future timestamp to its voting replicas.

Essentially, a coordinator node takes a transaction and proposes a future timestamp for it. It then presents this timestamp (Figure 1) to the “electorate” (voting replicas for the transaction). The replicas then check to see if they have any conflicting operations.

As long as a quorum of the voting replicas accepts the proposed timestamp (Figure 2), the coordinator applies the transaction at that time. This process is known as the “Fast Path,” because it can be done in a single round trip.

Figure 2: All the voting replicas “accept” the proposed timestamp, and the “Fast Path” application of the transaction can proceed.

However, if a quorum of voting replicas fails to “accept” the proposed timestamp, the conflicting operations are reported back to the coordinator along with a newly proposed timestamp for the original transaction.

Wrapping Up

The addition of ACID transactions to a distributed database like Cassandra is an exciting change, in part because it opens Cassandra up to several new use cases:

Automated payments
Game transactions
Banking transfers
Inventory management
Authorization policy enforcement

Previously, Cassandra would have been unsuited for the cases listed above. Many times, developers have had to say, “We want to use Cassandra for X, but we need ACID.” No more!

More importantly, this is the beginning of Cassandra evolving into a feature-rich database. And that is going to improve the developer experience by leaps and bounds and help to make Cassandra a first-choice datastore for all developers building mission-critical applications. If you want to stay up on the latest news about Cassandra developers, check out Planet Cassandra. While you’re there you can see an ever-growing list of real-world use cases if you need some ideas. And if you’re a Cassandra user, we’d love to publish your story here.

The post ACID Transactions Change the Game for Cassandra Developers appeared first on The New Stack.

Inside Tencent Games’ Real-Time Event-Driven Analytics System

Zhihao Chen — Tue, 25 Apr 2023 18:00:35 +0000

As part of Tencent Interactive Entertainment Group Global (IEG Global), Proxima Beta is committed to supporting our teams and studios to bring unique, exhilarating games to millions of players around the world. You might be familiar with some of our current games, such as PUBG Mobile, Arena of Valor and Tower of Fantasy.

Our team at Level Infinite, the brand for global publishing, is responsible for managing a wide range of risks to our business such as cheating and harmful content. From a technical perspective, this required us to build an efficient real-time analytics system to consistently monitor all kinds of activities in our business domain.

In this article, we will share our experience building this real-time, event-driven analytics system. First, we’ll explore why we built our service architecture based on command and query responsibility segregation (CQRS) and event-sourcing patterns with Apache Pulsar and ScyllaDB. Next, we’ll look at how we use ScyllaDB to solve the problem of dispatching events to numerous gameplay sessions. Finally, we’ll cover how we use ScyllaDB keyspaces and data replication to simplify our global data management.

A Peek at the Use Case: Addressing Risks in Tencent Games

Let’s start with a real-world example of what we’re working with and the challenges we face.

This is a screenshot from Tower of Fantasy, a 3D-action role-playing game. Players can use this dialog to file a report against another player for various reasons. If you were to use a typical CRUD system for it, how would you keep those records for follow-ups? And what are the potential problems?

The first challenge would be determining which team is going to own the database to store this form. There are different reasons to make a report (including an option called “Others”), so a case might be handled by different functional teams. However, there is not a single functional team in our organization that can fully own the form.

That’s why it is a natural choice for us to capture this case as an event, like “report a case.” All the information is captured in this event as is. All functional teams only need to subscribe to this event and do their own filtering. If they think the case falls into their domain, they can just capture it and trigger further actions.

CQRS and Event Sourcing

The service architecture behind this example is based on the CQRS and event-sourcing patterns. If these terms are new to you, don’t worry. By the end of this overview, you should have a solid understanding of these concepts. And if you want more detail at that point, take a look at our blog dedicated to this topic.

The first concept to understand here is event sourcing. The core idea behind event sourcing is that every change to a system’s state is captured in an event object, and these event objects are stored in the order in which they were applied to the system state. In other words, instead of just storing the current state, we use an append-only store to record the entire series of actions taken on that state. This concept is simple but powerful as the events that represent every action are recorded so that any possible model describing the system can be built from the events.

The next concept is CQRS, which stands for command query responsibility segregation. CQRS was coined by Greg Young over a decade ago and originated from the command and query separation principle. The fundamental idea is to create separate data models for reads and writes, rather than using the same model for both purposes. By following the CQRS pattern, every API should either be a command that performs an action or a query that returns data to the caller, but not both. This naturally divides the system into two parts: the write side and the read side.

This separation offers several benefits. For example, we can scale write and read capacity independently for optimizing cost efficiency. From a teamwork perspective, different teams can create different views of the same data with fewer conflicts.

The high-level workflow of the write side can be summarized as follows: Events that occur in numerous gameplay sessions are fed into a limited number of event processors. The implementation is also straightforward, typically involving a message bus such as Pulsar, Kafka or a simpler queue system that acts as an event store. Events from clients are persisted in the event store by topic, and event processors consume events by subscribing to topics. If you’re interested in why we chose Apache Pulsar over other systems, you can find more information in the blog referenced earlier.

Although queue-like systems are usually efficient at handling traffic that flows in one direction (such as fan-in), they may not be as effective at handling traffic that flows in the opposite direction (fan-out). In our scenario, the number of gameplay sessions will be large, and a typical queue system doesn’t fit well since we can’t afford to create a dedicated queue for every game-play session. We need to find a practical way to distribute findings and metrics to individual gameplay sessions through Query APIs. This is why we use ScyllaDB to build another queue-like event store, which is optimized for event fan-out. We will discuss this further in the next section.

Before we move on, here’s a summary of our service architecture.

Starting from the write side, game servers keep sending events to our system through command endpoints, and each event represents a certain kind of activity that occurred in a gameplay session. Event processors produce findings or metrics against the event streams of each gameplay session and act as a bridge between two sides. On the read side, we have game servers or other clients that keep polling metrics and findings through query endpoints and take further actions if abnormal activities have been observed.

Distributed Queue-Like Event Store for Time Series Events

Now let’s look at how we use ScyllaDB to solve the problem of dispatching events to numerous gameplay sessions. By the way, if you Google “Cassandra” and “queue,” you may come across an article from over a decade ago stating that using Cassandra as a queue is an anti-pattern. While this might have been true at that time, I would argue that it is only partially true today. We made it work with ScyllaDB, which is compatible with Cassandra.

To support the dispatch of events to each gameplay session, we use the session ID as the partition key so that each gameplay session has its own partition, and events belonging to a particular gameplay session can be located by the session ID efficiently.

Each event also has a unique event ID, which is a time universally unique identifier, as the clustering key. Because records within the same partition are sorted by the clustering key, the event ID can be used as the position ID in a queue. Finally, ScyllaDB clients can efficiently retrieve newly arrived events by tracking the event ID of the most recent event that has been received.

There is one caveat to keep in mind when using this approach: the consistency problem. Retrieving new events by tracking the most recent event ID relies on the assumption that no event with a smaller ID will be committed in the future. However, this assumption might not always hold true. For example, if two nodes generate two event identifiers at the same time, an event with a smaller ID might be inserted later than an event with a larger ID.

This problem, which I refer to as a “phantom read,” is similar to the phenomenon in the SQL world where repeating the same query can yield different results due to uncommitted changes made by another transaction. However, the root cause of the problem in our case is different. It occurs when events are committed to ScyllaDB out of the order indicated by the event ID.

There are several ways to address this issue. One solution is to maintain a cluster-wide status, which I call a “pseudo now,” based on the smallest value of the moving timestamps among all event processors. Each event processor should also ensure that all future events have an event ID greater than its current timestamp.

Another important consideration is enabling TimeWindowCompactionStrategy, which eliminates the negative performance impact caused by tombstones. Accumulation of tombstones was a major issue that prevented the use of Cassandra as a queue before TimeWindowCompactionStrategy became available.

Now let’s discuss other benefits beyond using ScyllaDB as a dispatching queue.

Simplifying Complex Global Data Distribution Challenges

Since we are building a multitenant system to serve customers around the world, it is essential to ensure that customer configurations are consistent across clusters in different regions. Trust us, keeping a distributed system consistent is not a trivial task if you plan to do it all by yourself.

We solved this problem by simply enabling data replication on a keyspace across all data centers. This means any change made in one data center will eventually propagate to others. Thank ScyllaDB, as well as DynamoDB and Cassandra, for the heavy lifting that makes this challenging problem seem trivial.

You might be thinking that using any typical relational database management system (RDBMS) could achieve the same result since most databases also support data replication. This is true if there is only one instance of the control panel running in a given region. In a typical primary/replica architecture, only the primary node supports read/write while replica nodes are read-only. However, when you need to run multiple instances of the control panel across different regions — for example, every tenant has a control panel running in its home region, or even every region has a control panel running for local teams — it becomes much more difficult to implement this using a typical primary/replica architecture.

If you have used AWS DynamoDB, you may be familiar with a feature called “global table” that allows applications to read and write locally and access the data globally. Enabling replication on keyspaces with ScyllaDB provides a similar feature, but without vendor lock-in. You can easily extend global tables across a multicloud environment.

Keyspaces as Data Containers

Next, let’s look at how we use keyspaces as data containers to improve the transparency of global data distribution.

Take a look at the diagram below. It shows a solution to a typical data distribution problem imposed by data protection laws. For example, suppose region A allows certain types of data to be processed outside of its borders as long as an original copy is kept in its region. As a product owner, how can you ensure that all your applications comply with this regulation?

One potential solution is to perform end-to-end (E2E) tests to ensure that applications send the correct data to the correct region as expected. This approach requires application developers to take full responsibility for implementing data distribution correctly. However, as the number of applications grows, it becomes impractical for each application to handle this problem individually, and E2E tests become increasingly expensive in terms of both time and money.

Let’s think twice about this problem. By enabling data replication on keyspaces, we can divide the responsibility for correctly distributing data into two tasks: 1) identifying data types and declaring their destinations, and 2) copying or moving data to the expected locations.

By separating these two duties, we can abstract away complex configurations and regulations from applications. This is because the process of transferring data to another region is often the most complicated part to deal with, such as passing through network boundaries, correctly encrypting traffic and handling interruptions.

After separating these two duties, applications are only required to correctly perform the first step, which is much easier to verify through testing at earlier stages of the development cycle. Additionally, the correctness of configurations for data distribution becomes much easier to verify and audit. You can simply check the settings of keyspaces to see where data is going.

Tips for Others Taking a Similar Path

To conclude, we’ll leave you with important lessons that we learned and that we recommend you apply if you take a similar path:

When using ScyllaDB to handle time series data, such as using it as an event-dispatching queue, remember to use the TimeWindowCompactionStrategy. Consider using keyspaces as data containers to separate the responsibility of data distribution. This can make complex data distribution problems much easier to manage.

This article is based on a tech talk presented at ScyllaDB Summit 2023. You can watch this talk — as well as talks by engineers from Discord, Epic Games, Strava, ShareChat and more — on demand.

Senior software engineer Zhiwei Peng also contributed to this article and project.

The post Inside Tencent Games’ Real-Time Event-Driven Analytics System appeared first on The New Stack.

Get to Know Warewulf 4

Jonathon Anderson — Fri, 21 Apr 2023 17:00:53 +0000

The Warewulf 4 cluster-provisioning system was recently added to the OpenHPC collection of open source tools for high-performance computing (HPC), alongside (and, eventually, supplanting) legacy Warewulf 3. If you’ve been waiting for downstream adoption before switching to Warewulf 4, now is a great time to get to know what’s so exciting about next-generation HPC provisioning.

Simpler and Easier to Maintain

Warewulf 3 was a complex beast, with a codebase written mostly in Perl that had evolved over many years. Warewulf 4 revisits many of the fundamental design decisions, making a final product that is easier to administer and maintain:

Warewulf 4 has been rewritten in Go, providing modern safety and usability features while enabling integration with a growing ecosystem of relevant libraries.
Warewulf 4’s configuration is all plain-text/YAML now, and all data (e.g., files for deployment to cluster nodes) are stored as files on the Warewulf server, giving more transparency to the cluster administrator.
Warewulf 4 is a stateless provisioning system throughout and has removed previous attempts to support stateful image deployment, simplifying the code base, feature set and interface. (That doesn’t mean your nodes have to be diskless, or even stateless, though!)
Warewulf 4 no longer assumes the same tight integration between the “head node” and its server that, in previous versions, required keeping them in tight sync for kernel versions and driver modules.

Translation Guide

Warewulf 4’s interface and, in many ways, its underlying architecture have been redesigned. So here’s a translation guide to get you started if you’re coming from Warewulf 3:

All of the previous Warewulf commands (including wwsh, wwbootstrap, wwmkchroot, wwmngchroot and wwvnfs) are replaced with a new command-line interface, wwctl. Unlike wwsh, wwctl does not operate as an interactive shell, but it does consolidate all of Warewulf 4’s administrative capabilities into a single command. (That single command does have multiple sub-commands, and each is documented with a man page and –help.)
Like Warewulf 3, Warewulf 4 builds its configuration on the concept of a node, which consolidates the configuration for each individual cluster node in a cluster. Warewulf 4 adds node “profiles,” which contain the same configuration parameters as a node but can then be applied to multiple nodes. Multiple profiles may also be applied to a single node, allowing a node’s final configuration to be built from remixable parts.
Like Warewulf 3, Warewulf 4 distributes pre-built node images via a network boot process. However, Warewulf 4 has renamed its images from vnfs (“virtual node file-system”) to containers, emphasizing Warewulf’s new integration with the OCI ecosystem. (More on that in a bit.)
Warewulf 3 supported importing files into its database for provisioning to cluster nodes beyond the node image itself. Warewulf 4 replaces the management of discrete files with its new “overlay” system, which groups related files for application either during system boot or periodically at runtime.

Import from OCI containers

Warewulf has always been, at its core, an automation system for building, maintaining, and serving node images to cluster nodes. The process for building these node images has historically been somewhat cumbersome and quite manual:

Make a chroot directory
Use dnf or yum to install packages in the chroot
Copy files into the chroot environment
Shell into the chroot environment and make configuration changes

While much of this process borrows from more general system administration procedures, even in the best case, you’re left with a thick directory full of manually built, unversioned, tenuously maintained states. This process must be repeated for each new image, at each site, with only documentation serving to aid reproducibility.

Warewulf 4 replaces the entire image initialization process with wwctl container import, which initializes new node images by importing them from upstream OCI container registries.

Want a cluster node based on Rocky Linux? It’s as simple as:

wwctl container import docker://docker.io/warewulf/rocky

Then you can shell into the image and make configuration changes as before, or you can extend your use of the OCI ecosystem by defining your customizations in your own Containerfile. This Containerfile is easily maintained, version-controlled and shared, allowing you to track changes to your node images over time and rebuild them reliably and trivially on-demand.

Manage Kernels Together with Node Images

There have also been improvements to kernel management. You no longer have to import kernels into Warewulf by hand: instead, Warewulf 4 automatically detects and uses the kernel from the node image itself.

This allows kernels to be managed within the image, using the same kernel packages that you would typically use when deploying the OS. This also has the side effect of simplifying the integration of out-of-tree drivers and other kernel extensions: modules built inside the container build against the kernel inside the container as usual, and that kernel is used by Warewulf during boot, with no manual synchronization required.

Customize Nodes with Overlays

As we mentioned before, Warewulf 4’s overlay system provides remixable, per-profile and per-node customization of provisioned cluster nodes at system boot time and periodically at runtime. But the overlay system goes beyond static files, supporting dynamic templates that are rendered with input data relevant to the target node or even the entire cluster.

With this functionality, it’s easy to write a single configuration file template that applies to all nodes in the cluster or even to write a configuration file template for a master node—say, a queueing system server—that draws on Warewulf metadata to automatically and dynamically define cluster-wide configuration.

The Future

Warewulf 4 is under active development and is growing quickly, adding significant features with each new release. You can download the latest release of Warewulf 4 from GitHub. We also hope you’ll join the community, either via our active Slack team or mailing list. And, of course, be sure to check out the Warewulf website, where you can find documentation and more, and the blog, where we’ll continue to post tips, tricks and information about new features in Warewulf as they are released.

The post Get to Know Warewulf 4 appeared first on The New Stack.

Distributed Database Architecture: What Is It?

Alexander Fridman — Tue, 18 Apr 2023 13:20:04 +0000

Databases power all modern applications. They’re behind your Angry Birds mobile game as much as they’re behind the space shuttle. In the beginning, databases were hosted on a single physical machine. Basically, it was a computer running only one program: the database. Then we moved to running databases on virtual machines, where resources are shared among multiple operating systems and applications.

In recent years, we moved to running databases in the cloud. And we no longer use a single database instance to store the data. Modern database systems are spread across multiple computers or nodes, which work together to store, manage and access the data.

This post is about distributed database architecture. We’ll cover what a distributed database is, what types exist, their benefits and drawbacks and how to design one.

What Is a Distributed Database?

As stated above, a distributed database is a database design that comprises several nodes working together. A node is basically a computing instance (it can also be a virtual machine or a container) that’s running the database. Each node in the distributed database has its own copy of the database, and these nodes communicate with each other to make sure they all have the same information.

Distributed databases offer many benefits over traditional single-server databases, including improved scalability, availability, performance and fault tolerance.

Why Switch from a Single Node to a Multinode Setup?

In the past, when data was measured in megabytes and database users were measured in dozens, a single database node could have done the work. A typical scenario for this kind of architecture was hosting the database on an on-premises mainframe machine. Developers connected to the database, ran queries, received the output, then disconnected. A single system administrator or a database administrator took care of the system in terms of availability, performance and upgrades.

Take Netflix as an example. It has a modern database architecture. Hundreds of millions of users all over the world use the application from different devices. Millions use the system at the same time. It should be available 24/7.

In this scenario, Netflix couldn’t possibly rely on a single computer running a single database application. If it goes down, millions of users will suffer a service disruption. In addition, storing all the data in one place is neither economically beneficial nor practical.

Imagine saving all the user data in one database instance running on a single server. The database backend should grow automatically as more subscribers join the service. Thus, a single on-premises database is simply not practical in terms of availability, scalability and fault tolerance.

Benefits of a Distributed Database Architecture

As mentioned above, distributed databases offer many benefits over traditional single-server databases, including improved scalability, availability, performance and fault tolerance.

Scalability

Compared to a single database that can only scale horizontally, distributed databases can scale vertically. In other words, if you have a single database, the only way to scale it so it can handle more load is to add memory and RAM. With a distributed database, you can add additional nodes.

Availability and Fault Tolerance

If you only have one database and the database goes down, the application will go down with it. But with a distributed database, losing a node won’t affect the whole application, and the service will continue to function.

Data Security

You can split data across multiple nodes. Therefore, if a node is breached, most of the application’s data will remain secure. The same goes for data corruption. If node data was corrupted due to a server or software error, it won’t affect other nodes.

Reduced Network Traffic

Distributed databases can reduce network traffic by storing data closer to where it will be used, reducing the need to transmit data over the network.

Drawbacks of Distributed Databases

Designing and implementing a single database instance is much easier than designing and implementing a distributed database architecture. The same applies to monitoring, troubleshooting, maintaining and upgrading. A distributed database requires thorough planning, the right database vendor, the right architecture and so forth.

In addition to the increased complexity, there’s also higher cost as it often requires more hardware, software and skilled personnel. Lastly, there are consistency and coordination issues. Ensuring consistency across all nodes in a distributed database can be challenging, especially in systems with high concurrency or large amounts of data.

Types of Distributed Database Architecture

There are several types of distributed database architectures. Each has its own strengths and weaknesses, and the choice of architecture depends on the application’s specific needs.

Master-Slave Replication

In master-slave architecture, there’s a single primary database that manages all write operations while one or more slave databases replicate the data from the master for read operations. So all insert operations go to one node, and read operations are distributed across nodes. This setup is ideal for read-intensive applications.

Multi-Master Replication

With multi-master replication, all nodes provide both read and write capabilities, both master and slave.

Shared-Nothing Architecture

In shared-nothing architecture, data is shared, and each node is responsible for only some of the data. Data is essentially split across nodes, and each node is responsible for both read and write.

In a federated database architecture, there are several independent databases (and even several database types) organized as one meta-database.

Federated Database Architecture

In a federated database architecture, there are several independent databases (and even several database types) organized as one meta-database. Basically, what you have here is a unified virtual database that you can query. The queries are distributed internally by the virtual database manager.

Examples of Distributed Databases

There are many examples and vendors that provide database solutions that work and that you can deploy as a distributed architecture. The following are the most popular:

MongoDB, a popular NoSQL document database that you can distribute across multiple servers. It stores data in collections rather than tables and in documents rather than rows.
Apache Cassandra, a highly scalable, distributed database system that’s designed for managing large volumes of structured and unstructured data across multiple data centers.
Amazon DynamoDB, a fully managed NoSQL database service.

Choosing and Designing Your Distributed Database Architecture

When it’s time to choose which database architecture you should use for your organization or application, there are several things to consider. There are no right or wrong answers here. Each architecture has its use cases, so you should choose an architecture that best fits yours. Consider (among other factors) data partitioning, replication and consistency. In more detail, here are some of the steps that you should take:

Identify the data that needs to be stored and accessed in the distributed database. This will help determine the amount of storage, schema design and so forth.
Determine your data partitioning strategy. Decide on the strategy for partitioning across multiple nodes.
Choose your replication strategy. You can choose between master-slave, multi-master or something else.
Decide on a consistency model. Choose whether you need your data to be consistent across nodes, eventually consistent or strongly consistent.

This is of course not an exhaustive list. You’ll also need to enlist an experienced architect.

Conclusion

Like any other technology, distributed databases have their advantages and drawbacks. However, for modern use cases, their advantages outweigh the drawbacks. There are several types of distributed database architecture, and you should only choose the one that best fits your needs after careful consideration.

The post Distributed Database Architecture: What Is It? appeared first on The New Stack.

How Open Source Has Turned the Tables on Enterprise Software

Sathya Sankaran — Mon, 17 Apr 2023 17:00:40 +0000

In the popular book “Working in Public: The Making and Maintenance of Open Source Software,” author Nadia Eghbal created the following new taxonomy for open source projects based on user growth and contributor growth:

The low user-growth solutions are often niche. The high user-growth open source solutions often invite competition if it isn’t already there. Federations like Linux Foundation and the Cloud Native Computing Foundation are the pinnacle of open source that support an entire ecosystem — like Linux and Kubernetes themselves.

Stadiums are projects where a large community enthusiastically watches a few contributors do their thing — much like how you watch your favorite sports team in your favorite stadium. These stadiums are often part of the ecosystem built by federations.

It is heartening that these high-growth categories are no longer exclusive to commercial solutions alone. For many years, Microsoft protected its turf from Linux by making its most popular products incompatible with Linux. Apple built a similar ecosystem by not playing nice with the alternatives. Of course, this strategy is only effective if the commercial solution is the market leader.

Open Source Is Now the Default Choice

In the world of cloud native and Kubernetes, open source solutions are the incumbent, the default choice for the customer. Here, the tables have turned, and the commercial solutions now carry the obligation to be compatible with open source projects or face many perils.

Let’s examine why enterprises make the switch from open source and what they must consider in an open source-dominated ecosystem like Kubernetes. Open source adoption has long been a “time vs. money” decision for enterprises. You can choose to spend money to save time or spend time to save money. With flourishing open source communities, does this primary hypothesis still hold? Do people pick open source based on time savings alone?

Here are a few other important factors that enterprises consider:

Vendor and Community Relationships

The most common reason to pick a commercial solution need not be technology but its relationships. Very successful large vendors offer a portfolio of products and build relationships with their customers and emerge as the “one throat to choke,” which is essentially based on dependability.

This is a factor that the Linux Foundation, several cloud vendors and Day 2 Kubernetes management platforms are looking to overcome by building a cohesive ecosystem. When you choose based on vendor relationships, it is key to either look for larger companies with a relevant portfolio or smaller vendors with a strong partnering track record.

Enterprise Fit and Usability

Not everyone buys the most cost-efficient car on the market. There are those who prefer comfort, performance, reliability, safety, maintenance and fuel type/efficiency. Open source software tends to have crowdsourced benefits when dealing with objective items such as performance and security, which applies to everyone. But the highly subjective challenges on usability and manageability mostly remain a challenge.

The community often ends up building good commuter car engines and fitting them with a cockpit that requires hundreds of hours of experience. If better fit and usability is one of your top concerns, you’ll get better bang for your buck if you buy into commercial open source solutions that build on top of the most popular open source solution to address enterprise requirements.

Scale and Growth

A commercial solution often treats its larger customers with care. In open source projects with high user growth, there is no differentiating between small and large customers. This creates an incentive to focus on features that affect the most users and thus roadmaps become popularity contests.

An enterprise with more complex needs may feel left behind, forcing it to contribute or seek alternatives — either creating the necessary tooling or moving to a commercial solution. When selecting a solution, one must always begin with the end in mind. How big do you expect to get, and can your solution grow seamlessly?

Documentation

An often overlooked and underappreciated challenge is documentation. While we are making significant progress to set standards for security with good adherence, documentation standards have been all over the place. Some of the most successful open source technologies have grown adoption through solid documentation because it is not code that people read to understand and adopt technology.

Promoting understanding is the key value of documentation. You can’t fix this problem overnight, but you can look for a commercial vendor that has a deep understanding of the open source technology in use. What is their experience in using and supporting that technology? Do they participate in the community and help?

The Curious Case of Velero

Now, let’s consider the case of the smooth-sailing Velero — an unchallenged open source backup tool for Kubernetes. Velero has over 50 million Docker Hub pulls — if you use a 1% conversion to real workloads, it implies over 500,000 clusters are being backed up.

In terms of community activity, there are over 2,500 issues closed and roughly 400 issues open with barely 10 active contributors. There are over 3,000 users in the Kubernetes #velero-users Slack workspace, and yet less than 10 people participate on community calls each week. This is a clear case of a stadium project; lots of users but a fairly centralized and small contributor community.

Velero is popular and serves an essential need, but why do people switch to a different Kubernetes backup product? This GigaOm Radar for Kubernetes Data Protection implies that there are at least 13 other solutions available.

VMware is the community sponsor for Velero, but VMware only includes commercial support for it if you have enterprise Tanzu licenses. If you have no need for Tanzu, then you need to look elsewhere to get support, hence the 3000 #velero-users in Slack.

Velero is functionally on par or superior to any other backup solution in terms of protecting Kubernetes workloads, but usability and scalability remain a challenge. Lack of policy-driven templates, user interface, monitoring and alerting, and guided recoveries are all key requirements of users that remain unaddressed.

Velero is also designed as a single cluster offering. It works great for backing up one Kubernetes cluster, but managing a multicluster estate is an issue. If you combine multiclusters with a multicloud strategy, managing backups becomes even more tricky and time-consuming.

Despite these unmet needs, Velero today maintains almost two orders of magnitude advantage over its nearest commercial competitor. This is a sign that Velero is becoming a standard, something its maintainers have also observed. Standardization increases the risk that any competing commercial data-protection solution you choose may end up as a niche solution or, worse as abandonware.

Getting on Board Velero

The popularity and sustainability of Velero may feel contradictory to its suitability at scale, or lack thereof. Enterprises that are dealing with this quandary must ask themselves if there is a way to have their cake and eat it too. Are there solutions that solve the scale and usability challenges on top of the open source Velero engine? Can one subscribe to an enterprise service that uses Velero? Can I adopt such a solution without having to migrate? Are there exit/egress costs?

It is clear that commercial vendors in the Kubernetes and cloud native communities must embrace and extend open source solutions such as Velero or they will be “toying” with their future and sacrificing a seat at the table when solutions are being considered by enterprises. It is, however, great that open source communities like Velero are building a longer table rather than taller fences.

The post How Open Source Has Turned the Tables on Enterprise Software appeared first on The New Stack.

How Can We Help Our Colleagues Become More Data Literate?

Hema Veeradhi — Mon, 17 Apr 2023 15:16:28 +0000

Data democratization — making data more accessible to all of an organization’s users — is predicted to be the top data-centric trend of 2023, according to research group TDWI. But is data democratization truly effective if users do not understand how to interpret the data they’re accessing?

Unfortunately, that is the case with a significant number of non-IT users. A recent study by Boston Consulting Group revealed that only 45% of respondents said that their company promotes data literacy among all employees, even though 73% expect the number of nontechnical data consumers to grow.

Being data literate is essential for everyone, however. By becoming data literate, application developers can build more intelligent applications, sales managers can better understand customer purchasing decisions and C-level executives can gauge how well their businesses are performing.

A data-literate organization also helps us, the data scientists. We don’t spend our days analyzing data and building models just for the sake of a paycheck; we want our work to be deployable, usable and beneficial to our colleagues and organizations.

So, how can we help others in our organizations become more data literate?

Double Down on Data Visualization

While we live in data, C-suite executives, sales managers, customer-service executives and others live in PowerPoint presentations. They understand visuals: pie and bar charts, infographics, demographic maps and the like.

To help our colleagues become more data literate, we must present data in a way that makes sense to them. Often, that means presenting findings in a highly visual manner that shows important information at a glance.

For example, at Red Hat we build models to help our customer-service representatives better understand the journeys our customers are taking with us. We have collected vast amounts of internal data culled from various sources throughout the organization, including sales, customer care, product management, product marketing and others.

Our data-science team analyzes the input and creates visualizations of the output in the form of user dashboards, reports and presentations — whatever visual format our colleagues are most comfortable using. They are not data scientists, nor do they want to be. Yet they have become data literate, not by being able to analyze data, but by being able to visualize and understand the intelligence behind the data.

Launch a Data-Literacy Education Program

Presenting data in a user-friendly way may not be enough. We need to ensure that users understand the importance of data to themselves and their organizations.

This can be done through data-literacy education programs. These programs can be modified based on the needs of different organizations, but education should be a consistent and foundational pillar.

Education should include regular and ongoing training on the value and importance of data literacy. Courses and modules can range from the basics (“What is data literacy and why is it important”), to the advanced (“What is data storytelling?”), to the specific (“The value of data visualization tools and how to use them effectively”).

The curriculum should be developed and taught by people who regularly work with data. For example, I helped develop a data-literacy training program at Red Hat. I wrote several courses and answered students’ questions. It’s been a great opportunity for me to share my expertise and help others learn how understanding data can boost their own careers.

Practice Proper Data Governance

International laws like the General Data Protection Regulation (GDPR) and the growing prevalence of state and local data privacy laws have made data protection and governance critical business initiatives. It’s important that anyone who handles data manages that data responsibly, especially if the data is personal and sensitive in nature.

Still, it’s not always easy to know what can, cannot or should be shared and with whom. Regulations and laws are continually changing, and data is often incomplete and inaccurate. And while many organizations, including Red Hat, have teams dedicated to data sovereignty and good data governance, it will sometimes be up to the individual to determine how to use data in a given instance. In these instances, users should take a moment and consider several factors, including:

What does the data contain? Does the data contain personally identifiable or proprietary information?
Who will see the data? Will the data be accessible to all or a wide group of users? Should it be accessible to such a large number of people?
Is the data accurate? Is it good enough to share on GitHub or a similar repository?

Helping our colleagues understand how to ask and answer these questions and giving them the knowledge and tools they need to interpret data will help create a data-literate culture. That will benefit everyone, especially as data continues to be a driving force for better business outcomes.

The post How Can We Help Our Colleagues Become More Data Literate? appeared first on The New Stack.

How to Get Started with Data Streaming

Michael Drogalis — Thu, 13 Apr 2023 15:10:12 +0000

Streaming data is everywhere, and today’s developer needs to learn how to build systems and applications that can ingest, process and act on data that’s continuously generated in real time.

From millions of connected devices in our homes and businesses to IT infrastructure, developers are tapping into an endless number of data streams to solve operational challenges and build new products, which means the learning opportunities are endless too.

While the term “data streaming” can apply to a host of technologies such as Rabbit MQ, Apache Storm and Apache Spark, one of the most widely adopted is Apache Kafka.

In the 12 years since this event-streaming platform made open source, developers have used Kafka to build applications that transformed their respective categories.

Think Uber, Netflix or PayPal. Data-streaming developers within these kinds of organizations and across the open source community have helped build applications with up-to–the-minute location tracking, personalized content feeds and real-time payment transactions.

These real-time capabilities have become so embedded into our daily lives that we now take them for granted and expect them. But before bringing those capabilities to life, developers first had to understand the platform’s decoupled publish-subscribe messaging pattern and how to best take advantage of it.

Understand How Kafka Works to Explore New Use Cases

Apache Kafka can record, store, share and transform continuous streams of data in real time. Each time data is generated and sent to Kafka; this “event” or “message” is recorded in a sequential log through publish-subscribe messaging.

While that’s true of many traditional messaging brokers like Apache ActiveMQ, Kafka is designed for both persistent storage and horizontal scalability.

When client applications generate and publish messages to Kafka, they’re called producers. As the data is stored, it’s automatically organized into defined “topics” and partitioned across multiple nodes or brokers. Client applications acting as “consumers” can then subscribe to data streams from specific topics.

These functionalities make Kafka ideal for use cases that require real-time ingestion and processing of large volumes of data, such as logistics, retail inventory management, IT monitoring and threat detection.

Because the two types of clients — publishers and consumers — operate independently, organizations can use Kafka to decentralize their data management model. This decentralization has the potential to free Kafka users from information silos, giving developers and other data-streaming practitioners access to shareable streams of data from across their organizations.

Technical Skills Every Data-Streaming Developer Needs

At the heart of it, a developer’s job is to solve problems with code. Kafka simply provides a new platform for solving user and business challenges. In some ways, the approach to problem-solving that data streaming requires is simpler than traditional object-oriented programming. But that doesn’t mean it won’t take time and effort to learn the basics and then master them.

In many Kafka use cases, downstream applications and systems depend on the data streams to which they’re subscribed to initiate processes, screen for changes in the world around us and trigger planned reactions to specific scenarios.

But the only way Kafka can enable these possibilities is if developers can bring data in and out of the platform. That’s not possible without the broader ecosystem of tools like Kafka Connect, Kafka Streams and ksqlDB.

For developers first learning how to use the data-streaming platform, Kafka Connect should be their initial focus. This data integration framework allows developers to connect Kafka with other systems, applications and databases that generate and store data through connectors.

Once developers have mastered creating data pipelines with Kafka, they’ll be ready to begin exploring streaming processing, which unlocks a host of operational and analytics use cases and the ability to create reusable data products.

Exploring What You Can Build with Kafka

Stream processing means ingesting and processing streaming data, all in real time. Transforming data in flight not only allows applications to react to the most recent, relevant information, but often also allows you to turn that data into a more consumable, shareable format.

As a database purpose-built for stream processing, ksqlDB allows developers to build pipelines that transform data as it’s ingested, and push the resulting streaming data into new topics after processing. Multiple applications and systems can then consume the transformed data in real time.

One of the most common processing use cases is change data capture (CDC), a data integration approach that calculates the change or delta in data so that information can be acted on. Applications like Netflix, Airbnb and Uber have used CDC to synchronize data across multiple systems without impacting performance or accuracy.

Although CDC can be achieved in other ways — for example with Debezium or Apache Flink — many organizations use ksqlDB to further enrich and transform the delta generated with CDC, stream the transformed data in real time and enable multiple downstream use cases as a result.

On the other hand, Kafka Streams is a client library that simplifies the way developers write client-side applications that stream data to and consume data from Kafka.

By learning how to use both these tools alongside Kafka and its connectors, developers can go from building “dumb” pipelines to stream-processing pipelines that transform data in real time, making it ready for streaming applications to act on.

Choosing Your First Kafka Project

The most successful developers are often the ones who feel inspired by the possibilities of what they’re creating. It doesn’t need to be complex or “innovative”; it just needs to be something that you’ll look forward to achieving. And as you prepare to start your first Kafka project, whether you want to use streaming data from a video game, develop your plant-monitoring system or create a market screener to fully take advantage of this platform, you should write down:

The source of streaming data you want to use.
Two or three problems you can solve or use cases you can implement using your chosen data stream.
The name of a learning partner or mentor you can contact to brainstorm and debug your code out loud. (In a pinch, rubber duck debugging is always an option, but it pays to have someone help you unpack the thought process behind your real-time problem-solving.)

While developers can build their own connectors to bring data in and out of Kafka, there are a wealth of open source and managed connectors that they can take advantage of. Many of the most common databases and business systems like PostgreSQL, Oracle, Snowflake and MongoDB already have connectors available.

Developers learning Kafka at work need to learn how to build data pipelines with connectors to quickly bring the data they work with every day into Kafka clusters. Those learning Kafka on their own can also find publicly available data-streaming sets available through free APIs.

Find a client library for your preferred language. While Java and Scala are most common, thanks to client libraries like Kafka Streams, there are still lots of options for using other programming languages such as C/C++ or Python.
Gain a deep understanding of why Kafka’s immutable data structures are useful.
Update the data modeling knowledge that you learned with relational databases so you can learn how to effectively use Schema Registry, Kafka’s distributed storage layer for metadata.
Brush up on your SQL syntax to prepare to use Kafka’s interactive SQL engine for stream processing, ksqlDB.

When you’re ready to start, create your first cluster, and then build an end-to-end pipeline with some simple data. Once you’ve learned to store data in Kafka and read it back — ideally using live, real-time data — you’ll be ready to begin exploring more complex use cases that leverage stream processing.

Set Up Yourself for Long-Term Success When Learning Kafka

As with any other developer skill set, the learning never really ends when it comes to Kafka. Even more important than the technologies and frameworks in the ecosystem, you need to learn to problem-solve with a data-streaming mindset. Instead of thinking of data as finite “sets of sets,” as they’re stored in relational databases, you’ll have to learn how to apply data stored as immutable, appending logs.

Connecting with other developers on the same journey — or ones who have been in your shoes before — allows you to learn from others’ mistakes and discover solutions you might never have considered.

Not only is Kafka one of the most active open source projects, but businesses in sectors like financial services, tech, logistics and manufacturing are doubling down on their investment in data streaming. With so many companies and industries standardizing on Kafka as the de facto solution for data streaming, there’s a robust community for newcomers to join and learn alongside.

Developers who invest time in learning how to solve these kinds of impactful use cases will have a wealth of job opportunities, which means more interesting problems to solve and space to grow their skills and careers.

The post How to Get Started with Data Streaming appeared first on The New Stack.

Why Cloud Data Replication Matters

Tyler Mitchell — Fri, 31 Mar 2023 16:26:35 +0000

Modern applications require data, and that data usually needs to be globally available, easily accessible and served with reliable performance expectations. Today, much of the heavy lifting happens behind the scenes. Let’s look at why the cloud factors into the importance of data replication for business applications.

Building Redundancy Protects Data

What is data replication? Simply put, it is a set of processes to keep additional copies of data available for emergencies, backups or to meet performance requirements. Copies may be done in duplicate, triplicate or more depending on the potential risk of a failure or the geographic spread of an application’s user base.

These multiple pieces of data may be chopped up into smaller pieces and spread around a server, network, data center or continent. This ensures data is always available and performance is unfailing in a scalable way.

Is the Complexity Worth It?

There are many reasons for building applications that understand replication, with or without cloud support. These are basic topics that any developer has had to deal with, but they are even more important when applications go global and/or mobile. Then they need ways to keep data secure and located efficiently.

These particular areas are commonly discussed when talking about cloud data replication:

Availability

This refers to making sure all data is ready for use when requested, with the latest versions and updates. Availability is affected when concurrent sessions do not share or replicate their data effectively. By replicating the latest changes to other nodes or servers, it should be instantly available to users who are accessing those other nodes.

Keeping a master copy is important, but it is equally important to keep that copy up to date as much as possible for all users. This means also keeping child nodes up to date with the master node so everyone stays up to date.

Latency

Data replication helps reduce latency of applications by keeping copies of data close to the end user of the application. Modern cloud applications are built on top of different networks often located in geographic regions where their user base is most active. While the overhead of keeping copies synchronized and copied might seem extreme, the positive impact on the end-user experience cannot be overstated — they expect their data to be close by and ready for use. If local servers have to go around the globe to fetch their data, the outcome is high latency and poor user experience.

Fault Tolerance

Replication is especially important for backup and disaster management purposes, such as when a node goes down. Replicas that were synchronized can then help recover data on new nodes that may be added due to a recent failure. When a data infrastructure requires too much manual copying of data during a failure, there are bound to be issues.

Failover of broken resources can be automated more fully when there are multiple replicas available, especially in different geographic regions that may not be affected by a regional disaster. Applications that can leverage data replication can also take care to preserve user data; otherwise, they risk losing information when a device breaks or a data center is destroyed.

Some see data replication as something “nice to have,” but as you can see, it’s not only about backup and disaster management; it’s also about application performance. There are other benefits as well that you can find as part of enterprise disaster management and performance plans.

Making Replication Work for You

The backend systems of a data replication system help keep copies of data spread around and redundant. This requires multiple nodes in the form of clusters that can communicate internally to keep data aligned. Adding a new cluster, a new node or new piece of data would then be automatically synchronized with other nodes to replicate it.

But the application level also needs to understand how the replication works. While a form-based app might just want a set of database tables, it must also understand that the source database has replicas available. Applications must know how to synchronize data it has just collected, as in a mobile app, so other users will have access.

The smaller pieces of data that are synchronized are often known as partitions. Different partitions go on different hardware — storage pools, racks, networks, data centers, continents, etc., so they are not all exposed to a single point of failure.

The potential for complexity is often the limiting factor for companies seeking to implement data replication. Having frontend and backend systems that handle it transparently is essential.

How Is Cloud Data Replication Different?

As you can see, data replication does not explicitly depend on using cloud resources. Enterprises have been using their internal networks for decades with some of the same benefits. But with the addition of cloud-based resources, the opportunity to have extremely high availability and performance is easier than ever.

Traditional data replication has now been extended beyond just replicating from a PC to a network or between two servers. Instead, applications can replicate to a global network of endpoints that serve multiple purposes.

Traditionally, replication was used to preserve data in case of a failure. For example, replicas could be copied to a node if there was a failure, but replicas could not be used directly by an application.

Cloud data replication extends the traditional approach by sending data to multiple cloud-based data services that stay in sync with one another.

Cloud-to-Cloud Data Replication

Today’s cloud services allow us to add yet another rung on this replication ladder, allowing replication between multiple clouds. This adds another layer of redundancy and reduces the risk of vendor lock-in. Hybrid cloud options also bring local enterprise data services into the mix with the cloud-based providers serving as redundant copies of a master system.

As you can imagine, there are multiple ways to diagram all these interconnections and layers of redundancy. This diagram shows a few of the common models.

(Source: Couchbase)

Building Your Backend Solution

Though the potential for an unbreakable data solution is more possible than ever, it can also become complicated quickly. Hybrid cloud-based architectures have to accommodate many edge cases and variables that make it challenging for developers to build on their own.

Ideally, your data management backend can already handle this for you. Systems must expose options in an easy-to-understand way so that architects and developers can have confidence and reduce risk.

For example, we built Couchbase from the ground up as a multinode, multicloud replication environment so you wouldn’t have to. Built-in options include easily adding/removing nodes, failing over broken nodes easily, connecting to cloud services, etc. This allows developers to select options and architectures they need for balancing availability and performance for their applications.

Learning More

Couchbase’s cross datacenter replication (XDCR) technology enables organizations to deploy geo-distributed applications with high availability in any environment (on premises, public and private cloud, or hybrid cloud). XDCR offers data redundancy across sites to guard against catastrophic data-center failures. It also enables deployments for globally distributed applications.

Read our whitepaper, “High Availability and Disaster Recovery for Globally Distributed Data,” for more information on the various topologies and approaches that we recommend.

Ready to try the benefits of cloud data replication with your own applications? Get started with Couchbase Capella:

Begin your free trial and see how easy it is to get started with Couchbase.
Find out more about Capella and watch the demo video.
Dig into more Couchbase videos on our YouTube channel.

The post Why Cloud Data Replication Matters appeared first on The New Stack.

Best Practices to Build IoT Analytics

Zoe Steinkamp — Mon, 27 Mar 2023 16:33:36 +0000

Today, Internet of Things (IoT) data or sensor data is all around us. Industry analysts project the number of connected devices worldwide to be a total of 30.9 billion units by 2025, up from 12.7 billion units in 2021.

When it comes to IoT data, keep in mind that it has special characteristics, which means we have to plan how to store and manage it to maintain the bottom line. Making the wrong choice on factors like storage and tooling can complicate data analysis and lead to increased costs.

A single IoT sensor sends, on average, a data point per second. That totals over 80,000 data points in a single day. And some sensors generate data every nanosecond, which significantly increases that daily total.

Most IoT use cases don’t just rely on a single sensor either. If you have several hundred sensors, all generating data at these rates, then we’re talking about a lot of data. You could have millions of data points in a single day to analyze, so you need to ensure that your system can handle time series workloads of this size. Otherwise, if your storage is inefficient, your queries are slow to return, and if you don’t configure your analysis and visualization tools for this type of data, then you’re in for a bad time.

In this article, I will go over six best practices to build efficient and scalable IoT analytics.

1. Start Your Storage Right

Virtually all IoT data is time series data. Therefore, consider storing your IoT data in a time series database because, as purpose-built solutions for unique time series workloads, it provides the best performance. The shape of IoT data generally contains the same four components. The first is simply the name of what you’re tracking. We can call that a measurement, and that may be temperature, pressure, device state or anything else. Next are tags. You may want to use tags to add context to your data. Think about tags like metadata for the actual values you’re collecting. The values themselves, which are typically numeric but don’t have to be, we can call fields. And the last component is a timestamp that indicates when the measurement occurred.

Knowing the shape and structure of our data makes it easier to work with when it’s in the database. So what is a time series database? It’s a database designed to store these data values (like metrics, events, logs and traces) and query them based on time. Compare this to a non-time series database, where you could query on an ID, a value type or a combination of the two. In a time series database, we query based entirely on time. As a result, you can easily see data from the past hour, the past 24 hours and any other interval for which you have data. A popular time series database is InfluxDB, which is available in both cloud and open source.

2. High-Volume Ingestion

Time series data workloads tend to be large, fast and constant. That means you need an efficient method to get your data into your database. For that we can look at a tool like Telegraf, an open source ingestion agent meant to run as a cron job to collect time series metrics. It has more than 300 plugins available for popular time series data sources, including IoT devices and more general plugins like execd, which you can use with a variety of data sources.

Depending on the database you choose to work with, other data ingest options may include client libraries, which allow you to write data using a language of your choice. For instance, Python is a common option for this type of tool. It’s important that these client libraries come from your database source so you know they can handle the ingest stream.

3. Cleaning the Data

You have three options when it comes to cleaning your data: You can clean it before you store it, after it’s in your database or inside your analytics tools. Cleaning up data before storage can be as simple as having full control over the data you send to storage and dropping data you deem unnecessary. Oftentimes, however, the data you receive is proprietary, and you do not get to choose which values you receive.

For example, my light sensor sends extra device tags that I don’t need, and occasionally, if a light source is suddenly lost, it sends strange, erroneous values, like 0. For those cases, I need to clean up my data after storing it. In a database like InfluxDB, I can easily store my raw data in one data bucket and my cleaned data in another. Then I can use the clean data bucket to feed my analytics tools. There’s no need to worry about cleaning data in the tools, where the changes wouldn’t necessarily replicate back to the database. If you wait until the data hits your analytics tools to clean it, that can use more resources and affect performance.

4. The Power of Downsampling

Cleaning and downsampling data are not the same. Downsampling is aggregating the data based on time. For example, dropping a device ID from your measurement is cleaning, while deriving the mean value for the last five minutes is downsampling. Downsampling is a powerful tool in that, like cleaning data, it can save you storage costs and make the data easier and faster to work with.

In some cases, you can downsample before storing it in its permanent database, for example, if you know that you don’t need the fine-grained data from your IoT sensors. You can also use downsampling to compare data patterns, like finding the average temperature across the hours of the day on different days or devices. The most common use for downsampling is to aggregate old data.

You monitor your IoT devices in real time, but what do you do with old data once new data arrives? Downsampling takes high-granularity data and makes it less granular by applying means, averages and other operations. This preserves the shape of your historical data so you can still do historical comparisons and anomaly detection while reducing storage space.

5. Real-Time Monitoring

When it comes to analyzing your data, you can either compare it to historical data to find anomalies, or you can set parameters. Regardless of your monitoring style, it’s important to do so in real time so that you can use the incoming data to make quick decisions and take fast action. The primary approaches for real-time monitoring include using a built-in option in your database, real-time monitoring tools or a combo of the two.

Regardless of the approach you choose, it’s critical for queries to have quick response times and minimal lag because the longer it takes for your data to reach your tools, the less real time it becomes. Telegraf offers output plugins to various real-time monitoring solutions. Telegraf is configured to work with time series data and is optimized for InfluxDB. So if you want to optimize data transport, you might want to consider that combination.

6. Historical Aggregation and Cold Storage

When your data is no longer relevant in real time, it’s common to continue to use it for historical data analysis. You might also want to store older data, whether raw or downsampled, in more efficient cold storage or a data lake. As great as a time series database is for ingesting and working with real-time data, it also needs to be a great place to store your data long term.

Some replication across locations is almost inevitable, but the more you can prevent that, the better, outside of backups, of course. In the near future, InfluxDB will offer a dedicated cold storage solution. In the meantime, you can always use Telegraf output plugins to send your data to other cold storage solutions.

When working with IoT data, it’s important to use the right tools, from storage to analytics to visualization. Selecting the tools that best fit your IoT data and workloads at the outset will make your job easier and faster in the long run.

The post Best Practices to Build IoT Analytics appeared first on The New Stack.

What Are Time Series Databases, and Why Do You Need Them?

Robert Kimani — Mon, 27 Mar 2023 12:00:04 +0000

The use of time series databases (TSDBs) has been prevalent in various industries for decades, particularly in finance and process-control systems. However, if you think you’ve been hearing more about them lately, you’re probably right.

The emergence of the Internet of Things (IoT) has led to a surge in the amount of time series data generated, prompting the need for purpose-built TSDBs.

Additionally, as IT infrastructure continues to expand, monitoring data from various sources such as servers, network devices, and microservices generates massive amounts of time series data, further highlighting the need for modern TSDBs.

While legacy time series solutions exist, many suffer from outdated architectures, limited scalability and integration challenges with modern data-analysis tools. Consequently, a new generation of time series databases has emerged, with over 20 new TSDBs being released in the past decade, particularly open source solutions.

These new TSDBs feature modern architectures that enable distributed processing and horizontal scaling, with open APIs that facilitate integration with data analysis tools and flexible deployment options in the cloud or on-premises.

Ensuring data quality and accuracy is critical for making data-driven decisions. Time series data can be “noisy” and contain missing or corrupted values. Time series databases often provide tools for cleaning and filtering data, as well as methods for detecting anomalies and outliers.

Here, we’ll explore the growing popularity of time series databases, the challenges associated with legacy solutions, and the features and benefits of modern TSDBs that make them an ideal choice for handling and analyzing vast amounts of time series data.

What Is Time Series Data?

First, let’s take a step back. What is time series data?

Time series data is data that is characterized by its time-based nature, where each datapoint is associated with a timestamp that indicates when it was recorded. This is in contrast to other types of data (such as transactional or document-based), which may not have a timestamp or may not be organized in a time-based sequence.

As mentioned earlier, one of the main reasons why time series data is becoming more prevalent is the rise of IoT devices and other sources of streaming data. IoT devices, such as sensors and connected devices, often generate large volumes of time-stamped data that need to be collected and analyzed in real time.

For example, a smart building might collect data on temperature, humidity and occupancy, while a manufacturing plant might collect data on machine performance and product quality.

Similarly, cloud computing and big data technologies have made it easier to store and process large volumes of time series data, enabling organizations to extract insights and make data-driven decisions more quickly.

Another factor contributing to the prevalence of time series data is the increasing use of machine learning and artificial intelligence, which often rely on time series data to make predictions and detect anomalies.

For example, a machine learning model might be trained on time series data from a sensor network to predict when equipment is likely to fail or to detect when environmental conditions are outside of normal ranges.

How Is Time Series Data Used?

There are many applications and industries that commonly use time series data, including:

IoT and smart buildings. Smart buildings use sensors to collect time series data on environmental factors such as temperature, humidity and occupancy, as well as energy usage and equipment performance. This data is used to optimize building operations, reduce energy costs and improve occupant comfort.

Manufacturing. Manufacturing plants collect time series data on machine performance, product quality and supply chain logistics to improve efficiency, reduce waste and ensure quality control.

Retail. Retailers use time series data to track sales, inventory levels and customer behavior. This data is used to optimize pricing, inventory management and marketing strategies.

Telecommunications. Telecommunications companies use time series data to monitor network performance, identify issues and optimize capacity. This data includes information on call volume, network traffic and equipment performance.

Finance. Financial trading firms use time series data to track stock prices, trading volumes and other financial metrics. This data is used to make investment decisions, monitor market trends and develop trading algorithms.

Health care. Health care providers use time series data to monitor patient vitals, track disease outbreaks and identify trends in patient outcomes. This data is used to develop treatment plans, improve patient care and support public health initiatives.

Transportation and logistics. Transportation and logistics companies use time series data to track vehicle and shipment locations, monitor supply chain performance and optimize delivery routes.

Energy. The energy industry relies on time series data to monitor and optimize power generation, transmission and distribution. This data includes information on energy consumption, weather patterns and equipment performance.

In each of these industries, time series data is critical for real-time monitoring and decision-making. It allows organizations to detect anomalies, optimize processes and make data-driven decisions that improve efficiency, reduce costs and increase revenue.

Time Series vs. Traditional Databases

Applications that collect and analyze time series data face specific challenges that are not encountered in traditional data storage and analytics.

Time series databases offer several advantages over traditional databases and other storage solutions when it comes to handling time series data. Here are some of the key advantages:

High Write Throughput

Time series databases are designed to handle large volumes of incoming data points that arrive at a high frequency. They are optimized for high write throughput, which means they can ingest and store large amounts of data quickly and efficiently, in real time. This is essential for applications that generate a lot of streaming data, such as IoT devices or log files.

For example, an IoT application that collects sensor data from thousands of devices every second requires a TSDB that can handle a high volume of writes and provide high availability to ensure data is not lost.

Traditional databases, on the other hand, are designed for transactional workloads and are not optimized for write-intensive workloads.

Automatic Downsampling and Compression

Time series data is high-dimensional — meaning it has multiple attributes or dimensions, such as time, location, and sensor values. It can quickly consume large amounts of storage space. Storing and retrieving this data efficiently can require a significant amount of disk space and computational resources.

TSDBs often use downsampling and compression techniques to reduce storage requirements, providing more efficient storage without sacrificing accuracy. This makes it more cost-effective to store large volumes of time series data.

Downsampling involves aggregating multiple data points into a single data point at a coarser granularity, while compression involves reducing the size of the data by removing redundant information. These techniques help to reduce storage costs and improve query performance, as less data needs to be queried.

Specialized Query Languages

TSDBs often offer specialized query languages that are optimized for time series data.

These languages, such as InfluxQL, support common time-based operations such as windowing, filtering, and aggregation, and can perform these operations efficiently on large datasets. This makes it easier to extract insights from time series data and perform real-time analytics.

Time-Based Indexing

Time series databases use time-based indexing to organize data based on timestamps, which allows for fast and efficient retrieval of data based on time ranges. This is important for applications that need to respond to events as they happen.

For instance: A financial trading firm needs to be able to respond to changes in the market quickly to take advantage of opportunities. This requires a time series database that can provide fast query response times and support real-time analytics.

Traditional databases may not have this level of indexing, which can make querying time-based data slower and less efficient.

Data Retention Policies

Time series databases offer features for managing data retention policies, such as automatically deleting or archiving old data after a certain period. This is important for managing storage costs and ensuring that data is kept only for as long as it is necessary.

Scalability and High Availability

Scalability and high availability are critical for handling large volumes of time series data. TSDBs are designed to scale horizontally across multiple nodes, which enables users to handle increasing data volumes and user demands. Along with automatic load balancing and failover, these features are critical for supporting high-performance, real-time analytics on time series data.

TSDBs can also provide high availability, which ensures that data is always accessible and not lost due to hardware failures or other issues.

5 Popular Open Source Time Series Databases

There are several open source TSDBs available in the market today. These databases offer flexibility, scalability, and cost-effectiveness, making them an attractive option for organizations looking to handle and analyze vast amounts of time series data.

These open source TSDBs are constantly evolving and improving, and each has its own strengths and weaknesses. Choosing the right one depends on the specific use case, data requirements and your organization’s IT infrastructure.

Here are five popular open source TSDBs:

Prometheus

Prometheus is an open source monitoring system with a time series database for storing and querying metrics data. It is designed to work well with other cloud native technologies, such as Kubernetes and Docker, and features a powerful query language called PromQL.

InfluxDB

InfluxDB’s unified architecture, which combines APIs for storing and querying data, background processing for extract, transform and load (ETL) and monitoring, user dashboards and data visualization, makes it a powerful and versatile tool for handling time series data.

With Kapacitor for background processing and Chronograf for UI, InfluxDB, intially developed by InfluxData, offers a comprehensive solution that can meet the needs of organizations with complex data workflows.

Additionally, InfluxDB’s open source nature allows for customization and integration with other tools, making it a flexible solution that can adapt to changing business needs.

TDengine

This open source TSDB is designed to handle large-scale time series data generated by IoT devices and industrial applications, including connected cars. Its architecture is optimized for real-time data ingestion, processing and monitoring, allowing it to efficiently handle data sets of terabyte and even petabyte-scale per day.

TDengine’s cloud native architecture makes it highly scalable and flexible, making it an appropriate choice for organizations that require a high-performance, reliable time series database for their IoT applications.

TimescaleDB

A powerful open source database designed to handle time series data at scale while making SQL scalable. TimescaleDB builds on PostgreSQL, and it is packaged as a PostgreSQL extension, providing automatic partitioning across time and space based on partitioning key.

One of the key benefits of TimescaleDB is its fast ingest capabilities, which make it well-suited for the ingestion of large volumes of time series data. Additionally, TimescaleDB supports complex queries, making it easier to perform advanced data analysis.

QuestDB

QuestDB is an open source TSDB that provides high throughput ingestion and fast SQL queries with operational simplicity. QuestDB supports schema-agnostic ingestion, which means that it can handle data from multiple sources, including the InfluxDB line protocol, PostgreSQL wire protocol, and a REST API for bulk imports and exports.

One of the key benefits of QuestDB is its performance. It is designed to handle large volumes of time series data with low latency and high throughput, making it well-suited for use cases such as financial market data, application metrics, sensor data, real-time analytics, dashboards, and infrastructure monitoring.

Conclusion

Selecting an appropriate time series database for a specific use case requires careful consideration of the data types supported by the database. While some use cases may only require numeric data types, many other scenarios, such as IoT and connected cars, may require Boolean, integer, and string data types, among others.

Choosing a database that supports multiple data types can provide benefits such as enhanced compression ratios and reduced storage costs. However, it is important to also consider the additional complexity and costs associated with using such a database, including data validation, transformation and normalization.

Ultimately, the right TSDB can enable real-time data-driven insights and improve decision-making, while the wrong one can limit an organization’s ability to extract value from its data. As time marches on and the volume of time series data continues to grow, selecting a TSDB that is suited to your organization’s needs will become increasingly crucial.

The post What Are Time Series Databases, and Why Do You Need Them? appeared first on The New Stack.

The Economics of Database Operations

John Page — Thu, 23 Mar 2023 15:25:50 +0000

Rising costs and uncertain economic times are causing many organizations to look for ways to do more with less. Databases are no exception. Fortunately, opportunities exist to increase efficiency and save money by moving to a document database and practicing appropriate data modeling techniques.

Document databases save companies money in two ways:

Object-centric, cross-language SDKs and schema flexibility mean developers can create and iterate production code faster, lowering development costs.
Less hardware is necessary for a given transaction throughput, significantly reducing operational costs.

Developer Efficiency

All modern development uses the concept of objects. Objects define a set of related values and methods for reading, modifying and deriving results from those values. Customers, invoices and train timetables are all examples of objects. Objects, like all program variables, are transient and so must be made durable by persisting them to disk storage.

We no longer manually serialize objects into local files the way Windows desktop developers did in the 1990s. Nowadays data is stored not on the computer running our application, but in a central place accessed by multiple applications or instances of an application. This shared access means not only do we need to read and write data efficiently over a network, but also implement mechanisms to allow concurrent changes to that data without one process overwriting another’s changes.

Relational databases predate the widespread use and implementation of object-oriented programming. In a relational database, data structures are mathematical tables of values. Interaction with the data happens via a specialized language, SQL, that has evolved in the past 40 years to provide all sorts of interaction with the stored data: filtering and reshaping it, converting it from its flat, deduplicated interrelated model into the tabular, re-duplicated, joined results presented to applications. Data is then painfully converted from these rows of redundant values back into the objects the program requires.

Doing this requires a remarkable amount of developer effort, skill and domain knowledge. The developer has to understand the relationships between the tables. They need to know how to retrieve disparate sets of information, and then rebuild their data objects from these rows of data. There is an assumption that developers learn this before entering the world of work and can just do it. But this is simply untrue. Even when a developer has had some formal education in SQL, it’s unlikely that the developer will know how to write nontrivial examples efficiently.

Document databases start with the idea of persisting objects. They allow you to persist a strongly typed object to the database with very little code or transformation, and to filter, reshape and aggregate results using exemplar objects, not by trying to express a description in the broken English that is SQL.

Imagine we want to store a customer object where customers have an array of some repeating attribute, in this case, addresses. Addresses here are a weak entities not shared between customers. Here is the code in C# /Java-like pseudocode:

class Address : Object {
  
   Integer number;
   String street, town, type;

   Address(number, street, town, type) {
       this.number = number
       this.street = street
       this.town = town,
       this.type = type
   }

   //Getters and setters or properties as required
}
class Customer :  Object {
   GUID customerId;
   String name, email
   Array < Address > addresses;

   Customer(id, name, email) {
       this.name = name;
       this.email = email;
       this.customerId = id
       this.addresses = new Array < Address > ()
   }
   //Getters and setters or properties as required
}
  
Customer newCustomer = new Customer(new GUID(),
   "Sally Smith", "sallyport@piratesrule.com")
  
Address home = new Address(62, 'Swallows Lane', 'Freeport', 'home')
newCustomer.addresses.push(home)

To store this customer object in a relational database management system (RDBMS) and then retrieve all the customers in a given location, we need the following code or something similar:

//Connect
RDBMSClient rdbms = new RDBMSClient(CONNECTION_STRING)
rdbms.setAutoCommit(false);


// Add a customer

insertAddressSQL = "INSERT INTO Address (number,street,town,type,customerId) values(?,?,?,?,?)"
preparedSQL = rdbms.prepareStatement(insertAddressSQL)
for (Address address of newCustomer.addresses) {
   preparedSQL.setInt(1, address.number)
   preparedSQL.setString(2, address.street)
   preparedSQL.setString(3, address.town)
   preparedSQL.setString(4, address.type)
   preparedSQL.setObject(5, customer.customerId)
   preparedStatement.executeUpdate()
}

insertCustomerSQL = "INSERT INTO Customer (name,email,customerId) values(?,?,?)"
preparedSQL = rdbms.prepareStatement(insertCustomerSQL)
preparedSQL.setString(1, customer.name)
preparedSQL.setString(2, customer.email)
preparedSQL.setObject(3, customer.customerId)
preparedStatement.executeUpdate()
rdbms.commit()


//Find all the customers with an address in freeport


freeportQuery = "SELECT ct.*, ads.* FROM address ad
INNER JOIN address ads ON ad.customerId=ads.customerId AND ad.town=?
INNER JOIN customer ct ON ct.customerId = ad.customerId"
 
preparedSQL = rdbms.prepareStatement(freeportQuery)
preparedSQL.setString(1, 'Freeport')
ResultSet rs = preparedSQL.executeQuery()
String CustomerId = ""
Customer customer; 

//Convert rows back to objects

while (rs.next()) {
   //New CustomerID value
   if rs.getObject('CustomerId').toString != Customerid) {
       if (customerId != "") { print(customer.email) }
       customer = new Customer(rs.getString("ct.name"),   
                               rs.getString('ct.email'), 
                               rd.getObject('CustomerId')
   
  }
   customer.addresses.push(new Address(rs.getInteger('ads.number'), 
                           rs.getString("ads.street"),
                           rs.getString('ads.town'),         
                           rs.getString("ads.type")))
}
if (customerId != "") { print(customer.email) }

This code is not only verbose and increasingly complex as the depth or number of fields in your object grows, but adding a new field will require a slew of correlated changes.

By contrast, with a document database, your code would look like the following and would require no changes to the database interaction if you add new fields or depth to your object:

//Connect
mongodb = new MongoClient(CONNECTION_STRING)
customers = mongodb.getDatabase("shop").getCollection("customers",Customer.class)

//Add Sally with her addresses
customers.insertOne(newCustomer)

//Find all the customers with an address in freeport
FreeportCustomer = new Customer()
FreeportCustomer.set("addresses.town") = "Freeport"

FindIterable < Customer > freeportCustomers = customers.find(freeportCustomer)
for (Customer customer : freeportCustomers) {
   print(customer.email) //These have the addresses populated too
}

When developers encounter a disconnect between the programming model (objects) and the storage model (rows), they’re quick to create an abstraction layer to hide it from their peers and their future selves. Code that automatically converts objects to tables and back again is called an object relational mapper or ORM. Unfortunately, ORMs tend to be language-specific, which locks development teams into that language, making it more difficult to use additional tools and technologies with the data.

Using an ORM doesn’t free you from the burden of SQL either when you want to perform a more complex operation. Also, since the underlying database is unaware of objects, an ORM usually cannot provide much efficiency in its database storage and processing.

Document databases like MongoDB persist the objects that developers are already familiar with so there’s no need for an abstraction layer like an ORM. And once you know how to use MongoDB in one language, you know how to use it in all of them and you never have to move from objects back to querying in pseudo-English SQL.

It’s also true that PostgreSQL and Oracle have a JSON data type, but you can’t use JSON to get away from SQL. JSON in an RDBMS is for unmanaged, unstructured data, a glorified string type with a horrible bolt-on query syntax. JSON is not for database structure. For that you need an actual document database.

Less Hardware for a Given Workload

A modern document database is very similar to an RDBMS internally, but unlike the normalized relational model where the schema dictates that all requests are treated equally, a document database optimizes that schema for a given workload at the expense of other workloads. The document model takes the idea of the index-organized table and clustered index to the next level by co-locating not only related rows as in the relational model, but all the data you are likely to want to use for a given task. It takes the idea that a repeating child attribute of a relation does not need to be in a separate table (and thus storage) if you have a first-order array type. Or in other terms, you can have a column type of “embedded table.”

This co-location or, as some call it, the implicit joining of weak entity tables reduces the costs of retrieving data from storage as often only a single cache or disk location needs to be read to return an object to the client or apply a filter to it.

Compare this with needing to identify, locate and read many rows to return the same data and the client-side hardware necessary to reconstruct an object from those rows, a cost so great that many developers will put a secondary, simpler key-value store in front of their primary database to act as a cache.

These developers know the primary database cannot reasonably meet the workload requirements alone. A document database requires no external cache in front of it to meet performance targets but can still perform all the tasks of RDBMS, just more efficiently.

How much more efficiently? I’ve gone through the steps of creating a test harness to determine how much efficiency and cost savings can be achieved by using a document database versus a standard relational database. In these tests, I sought to quantify the transactional throughput per dollar for a best-in-class cloud-hosted RDBMS versus a cloud-hosted document database, specifically MongoDB Atlas.

The use case I chose represents a common, real-world application where a set of data is updated frequently and read even more frequently: an implementation of the U.K. Vehicle Inspection (MOT) system and its public and private interfaces, using its own published data.

The tests revealed that create, update and read operations are considerably faster in MongoDB Atlas. Overall, on similarly specified server instances with a similar instance cost, MongoDB Atlas manages approximately 50% more transactions per second. This increases as the relational structure becomes more complex, making the joins more expensive still.

In addition to the basic instance costs, the hourly running costs of the relational database varied between 200% and 500% of the Atlas costs for these tests due to additional charges for disk utilization. The cost of hosting the system, highly available to meet a given performance target, was overall three to five times less expensive on Atlas. In the simplest terms, Atlas could push considerably more transactions per second per dollar.

Independent tests confirm the efficiency of the document model. Temenos, a Swiss-based software company used by the largest banks and financial institutions in the world, has been running benchmark tests for more than 15 years. In its most recent test, the company ran 74,000 transactions per second (TPS) through MongoDB Atlas.

The tests resulted in throughput per core that was up to four times better than a similar test from three years ago while using 20% less infrastructure. This test was performed using a production-grade benchmark architecture with configurations that reflected production systems, including nonfunctional requirements like high availability, security and private links.

During the test, MongoDB read 74,000 TPS with a response time of 1 millisecond, all while also ingesting another 24,000 TPS. Plus, since Temenos is using a document database, there’s no caching in the middle. All the queries run straight against the database.

Summary

Unless you intend to have a single database in your organization used by all applications, moving your workloads from relational to a document model would allow you to build more in less time with the same number of developers, and spend considerably less running it in production. It’s unlikely that your business hasn’t started using object-oriented programming yet. So why haven’t you started using object-oriented document databases?

The post The Economics of Database Operations appeared first on The New Stack.

Digital Anarchy: Abolishing State

Tim Banks — Thu, 23 Mar 2023 13:32:20 +0000

Think about the high jump in track-and-field competitions. You run up, almost parallel to the bar, and jump toward the bar, pushing off your outside leg. In the air, you turn so that you are traveling headfirst toward the bar but facing away from it. Jumpers get their head over the bar, and then arch their back as their head starts to fall, keeping their hips traveling upward. Then as their hips fall, their knees stay over the bar, and as their knees start to drop, they flick their feet over the bar.

This method is called the Fosbury Flop, and it took the track-and-field world by storm after Dick Fosbury used that method to win gold at the 1968 Olympic Games.

Up until that point, high jumpers used jump techniques that scissored their legs over the bar, or other methods of feet-first jumps. These were predominant because the high jump originated before the advent of quality deep pads to land on, so they were designed so that the jumper could land on their feet. Once better landing surfaces were available, jumpers could use different, and revolutionary, techniques.

As with anything revolutionary, many a pearl was clutched by high-jump traditionalists, and I’m sure those deep cushions came in handy for the “but that’s not the way we’ve always done it” crowd to swoon onto.

In many ways, technologists have been jumping over ever-higher bars of performance and user experiences as we traverse technological breakthrough after breakthrough.

In many ways, technologists have been jumping over ever-higher bars of performance and user experiences as we traverse technological breakthrough after breakthrough. Going back to the Dawn of the Computer Age (DotCA), computers were very large, expensive, delicate and hard to use. As such, they were a shared resource and were kept locked away in rooms requiring infinite space.

As networking capabilities developed, users could be in a different location from the more powerful computers (henceforth known as servers) and interface with them through terminal clients, which were very light-duty computers. This became the standard use pattern that is still the predominant method we have today.

Image 1

Even our fancy multitiered, load-balanced, Kubernetes-based cloud native application architecture is still based on this premise: that large, powerful groups of computers were going to host code and data and run complex digital alchemy. All for what seems like just pictures of our cats.

However, something has happened since the DotCA to call into question the necessity of this basic architecture: Computers got cheaper, more powerful and more portable. Adobe did a comparison of a 1980s Cray supercomputer to an iPhone 12 showing the difference in power that we possess in the palms of our hands. Home computers, laptops, phones, smart devices, cameras, cars and even your home appliances all have powerful computers in them with gigabytes, and even terabytes, worth of storage and are connected to the internet.

The idea that compute needs to live in data centers is long gone, and we are seeing that in application development for these platforms, where the computing power of these clients provides a soft-landing place for code. The adoption of edge computing and fast cellular networks decentralizes compute even more, and that trend doesn’t look like it’s going to change.

Where this decentralization runs off the rails and where we are abruptly slammed back to 1965 is when we deal with state and other data storage.

Keeping state in applications requires a data store of some sort. Since the DotCA, state for networked apps was primarily kept on servers. That said, when client systems were able to have hard drives or removable media, you could keep state for some applications with the client.

Many of us who are longer in the tooth remember needing to have a specific floppy disk with your data that you could take with you to various computers and run your application. It was somewhat convenient because you only needed to have access to the application itself, and you kept your own data with you.

How many websites and applications have your email address, your name, your birthday, credit card information, your address and so on?

In modern times, client computers don’t have magnetic or optical removable storage, and for pretty much every practical purpose, they don’t need them. You can keep documents, photos, videos, cat memes, PDFs of your recipes and your risky selfies in cloud-based storage. They are accessible anywhere and, in theory, only accessible by the owner of those artifacts.

It’s very common to consolidate this storage into a single provider for ease of use and interoperability between devices within an ecosystem. You can access your files using biometric authentication from your specific authorized devices, and often with multifactor authentication. For example, while it’s not impossible to hack an iPhone to get someone’s files in iCloud, it’s not trivial, either.

For modern web and cloud native applications, this is far from the case. Each application will have its own data storage to keep its own state, in the form of flat files in object store, relational and nonrelational databases, caches and so on. While the backend services are standard (MySQL, PostgreSQL, Redis, Elasticsearch, Oracle, etc.), the structures and schemas for that data are usually unique to whatever application they’re serving.

Another thing to consider is how much of that data is redundant between the various applications. How many websites and applications have your email address, your name, your birthday, credit card information, your address and so on? Every application you fill that data out for is yet another copy of the same data in a different location.

This redundancy comes with increased risk — it’s yet another opportunity to have that data compromised. Currently, the grand total of digital storage for all humankind is somewhere in the 200 exabyte range, and while most of it are probably just memes stored in S3, think about how much of that data is redundant.

Imagine an architecture where a person has their own state store.

Another thing to consider is what is done with that data once it is written. Does it need to live on in perpetuity for compliance? Is it ever deleted? What happens to it when one company is acquired by another? Is it used to model artificial intelligence, which you agreed to in all those TOS/EULAs that you didn’t read? GitHub Copilot and OpenAI are already in court because they are emitting copyrighted code, and AI-powered illustrations are being called out for similar concerns by artists. How much control do we have over our data that we are providing to these services so they can keep state on their own?

So, with the client-side compute and storage capabilities, and the presence of edge computers to decentralize data as our deep cushions, what if we did something revolutionary and abandoned the notion that applications need to keep server-side state?

Let’s imagine an architecture where a person has their own state store. It can be on a local device, on a singular cloud-based store or even on an intentionally ephemeral storage platform. Applications are built to run either natively on a device or in the browser, and they access client-side data stores to preserve state. These applications would use a universal architecture and schema so they can read the existing data for state and write only what they need for their specific applications.

These client data stores would allow apps to only read data that the client allows and could be document/table/key encrypted such that no two applications could read each other’s data, and the client could revoke encryption keys as they like to prevent applications from reading any data.

Such an architecture would revolutionize application development and deployment practices since only static code would need to be served, and any dynamic execution would happen on the client. This aspect of it isn’t new, but these applications wouldn’t phone their own servers and object stores to read and record transactions; they would access the client’s data stores.

Obviously there are some use cases where this may not be the best idea (financial institutions and business intelligence tools come to mind) and the ways that you would design these applications would be dramatically different in some cases, but think of the possibilities. Clients have control over their data, in a single place. Access to that data can use biometric multifactor authentication. You won’t have multiple copies of the same information living in various big noisy rooms around the globe unless you choose to do so.

Just imagine the difference in power consumption if you’re not burning old plants to power computers to hold the 2,8175th copy of your answer to the question, “What street did you grow up on?”

You don’t need large, powerful servers doing so much compute all the time because it’s done at the client side, so you have smaller, more efficient data centers. Just imagine the difference in power consumption if you’re not burning old plants to power computers to hold the 2,8175th copy of your answer to the question, “What street did you grow up on?” Infrastructure deployment looks dramatically different since state storage is not a driving concern. Even more server-side compute load can be born at the edge and, if you’re good, can even become fairly compute architecture agnostic.

Over the coming weeks and months, I’ll dig more into some of the details and aspects of what this future would look like, but I would like to challenge you, dear reader, to do the same. We have developed the deep cushions, and now it’s time to do something revolutionary in the way we get over higher bars for performance, security, data protection and sustainability in 2023 and beyond.

The post Digital Anarchy: Abolishing State appeared first on The New Stack.

ScyllaDB’s Incremental Changes: Just the Tip of the Iceberg

George Anadiotis — Mon, 20 Mar 2023 16:29:14 +0000

Is incremental change a bad thing? The answer, as with most things in life, is “it depends.” In the world of technology specifically, the balance between innovation and tried-and-true concepts and solutions seems to have tipped in favor of the former. Or at least, that’s the impression the headlines give. Good thing there’s more to life than headlines.

Innovation does not happen overnight, and is not applied overnight either. In most creative endeavors, teams work relentlessly for long periods until they are ready to share their achievements with the world. Then they go back to their garage and keep working until the next milestone is achieved. If we were to peek in the garage intermittently, we’d probably call what we’d see most of the time “incremental change.”

The ScyllaDB team works with their garage doors up and are not necessarily after making headlines. They believe that incremental change is nothing to shun if it leads to steady progress. Compared to the release of ScyllaDB 5.0 at ScyllaDB Summit 2022, “incremental change” could be the theme of ScyllaDB Summit 2023 in February. But this is just the tip of the iceberg, as there’s more than meets the eye here.

I caught up with ScyllaDB CEO and co-founder Dor Laor to discuss this.

The ScyllaDB team works with their garage doors up. A look at what kept the team busy in 2022, plus trends and tradeoffs in high-performance compute and storage.

Note: In addition to reading the article, you can hear the complete conversation in this podcast:

Mini player:

Data Is Going to the Cloud in Real Time, and so Is ScyllaDB

The ScyllaDB team has their ears tuned to what their clients are doing with their data. What I noted in 2022 was that data is going to the cloud in real time, and so is ScyllaDB 5.0. Following up, I wondered whether those trends have kept pace with the way they manifested previously.

The answer, Laor confirmed, is simple: Absolutely yes. ScyllaDB Cloud, the company’s database-as-a-service, has been growing over 100% year over year in 2022. In just three years since its introduction in 2019, ScyllaDB Cloud is now the major source of revenue for ScyllaDB, exceeding 50%.

“Everybody needs to have the core database, but the service is the easiest and safest way to consume it. This theme is very strong not just with ScyllaDB, but also across the board with other databases and sometimes beyond databases with other types of infrastructure. It makes lots of sense,” Laor noted.

Similarly, ScyllaDB’s support for real-time updates via its change data capture (CDC) feature is seeing lots of adoption. All CDC events go to a table that can be read like a regular table. Laor noted that this makes CDC easy to use, also in conjunction with the Kafka connector. Furthermore, CDC opens the door to another possibility: Using ScyllaDB not just as a database, but also as an alternative to Kafka.

“It’s not that ScyllaDB is a replacement for Kafka. But if you have a database plus Kafka stack, there are cases that instead of queuing stuff in Kafka and pushing them to the database, you can just also do the queuing within the database itself” Laor said.

This is not because Kafka is bad per se. The motivation here is to reduce the number of moving parts in the infrastructure. Palo Alto Networks did this and others are following suit too. Numberly is another example in which ScyllaDB was used to replace both Kafka and Redis.

High Performance and Seamless Migration

Numberly was one of the many use cases presented in ScyllaDB Summit 2023. Others included the likes of Discord, Epic Games, Optimizely, ShareChat and Strava. Browsing through those, two key themes emerge: High performance and migration.

Migration is a typical adoption path for ScyllaDB as Laor shared. Many users come to ScyllaDB from other databases in search of scalability. As ScyllaDB sees a lot of migrations, it offers support for two compatible APIs, one for Cassandra and one for DynamoDB. There are also several migration tools, such as Spark Migrator, scanning the source database and writing to the target database. CDC may also help there.

While each migration has its own intricacies, when organizations like Discord or ShareChat migrate to ScyllaDB, it’s all about scale. Discord migrated trillions of messages. ShareChat migrated dozens of services. Things can get complicated and users will make their own choices. Some users rewrite their stack without keeping API compatibility, or even rewrite parts of their codebase in another programming language like Go or Rust.

Epic Games, Optimizely and Strava all presented high-performance use cases. Epic Games are the creators of the Unreal game engine. Its evolution reflects the way that modern software has evolved. Back in the day, using the Unreal game engine was as simple as downloading a single binary file. Nowadays, everything is distributed. Epic Games works with game makers by providing a reference architecture and a recommended stack, and makers choose how to consume it. ScyllaDB is used as a distributed cache in this stack, providing fast access over objects stored in AWS S3.

A Sea of Storage, Raft and Serverless

ScyllaDB increasingly is being used as a cache. For Numberly, it happened because ScyllaDB does not need a cache, which made Redis obsolete. For Epic Games, the need was to add a fast-serving layer on top of S3.

S3 works great and is elastic and economic, but if your application has stringent latency requirements, then you need a cache, Laor pointed out. This is something a lot of people in the industry are aware of, including ScyllaDB engineering. As Laor shared, there is an ongoing R&D effort in ScyllaDB to use S3 for storage too. As he put it:

“S3 is a sea of extremely cheap storage, but it’s also slow. If you can marry the two, S3 and fast storage, then you manage to break the relationship between compute and storage. That gives you lots of benefits, from extreme flexibility to lower TCO.”

This is a key project that ScyllaDB’s R&D is working on these days, but not the only one.

In the upcoming open source version 5.2, ScyllaDB will have a consistent transactional schema operation based on Raft. In the next release, 5.3, transactional topology changes will also be supported. Metadata strong consistency is essential for sophisticated users who programmatically scale the cluster and Data Definition Language.

This paves the way toward changes in the way ScyllaDB shards data, making it more dynamic and leading to better load balancing.

Many of these efforts come together in the push toward serverless. A free trial based on serverless was made available at the ScyllaDB summit and will become generally available later this year.

P99, Rust and Beyond

Laor does not live in a ScyllaDB-only world. Being someone with a hypervisor and Linux Red Hat background, he appreciates the nuances of P99. P99 latency is the 99th latency percentile. This means 99% of requests will be faster than the given latency number, or that only 1% of the requests will be slower than your P99 latency.

P99 CONF is also the name of an event on all things performance, organized by the ScyllaDB team but certainly not limited to them. P99 CONF is for developers who obsess over P99 percentiles and high-performance, low-latency applications. It’s where developers can get down in the weeds, share their knowledge and focus on performance in real time.

Laor said that participants are encouraged to present databases, including competitors, as well as talk about operating systems development, programming languages, special libraries and more.

ScyllaDB’s original premise was to be a faster implementation of the Cassandra API, written in C++ as opposed to Java.

Although ScyllaDB has invested heavily in its C++ codebase and perfected it over time, Laor also gives Rust a lot of credit. So much, in fact, that he said it’s quite likely that if they were to start the implementation effort today, they would have done it using Rust. Not so much for performance reasons, but more for the ease of use. In addition, many ScyllaDB users like Discord and Numberly are moving to Rust.

Even though a codebase that has stood the test of time is not something any wise developer would want to get rid of, ScyllaDB is embracing Rust too. Rust is the language of choice for the new generation of ScyllaDB’s drivers. As Laor explained, going forward the core of ScyllaDB’s drivers will be written in Rust. From that, other language-specific versions will be derived. ScyllaDB is also embracing Go and Wasm for specific parts of its codebase.

To come full circle, there’s a lot of incremental change going on. Perhaps if the garage door wasn’t up, and we only got to look at the finished product in carefully orchestrated demos, those changes would stack up more impressions. Apparently, that’s not what matters more for Laor and ScyllaDB.

The post ScyllaDB’s Incremental Changes: Just the Tip of the Iceberg appeared first on The New Stack.

Historical Data and Streaming: Friends, Not Foes

Michael Drogalis — Mon, 20 Mar 2023 11:00:32 +0000

Real-time event streaming has become one of the most prominent tools for software engineers over the last decade. In Stack Overflow’s 2022 Developer Survey, Apache Kafka, the de facto event-streaming platform, is ranked as one of the highest-paying tech skills and most-loved frameworks.

While obscure at its outset, there are now countless stories of companies using it at massive scale for use cases like gaming and ride-sharing, where latency must remain incredibly low. Because these examples are talked about the most, many people believe event streaming — also called data streaming — is only appropriate for use cases with demanding real-time requirements and not suitable for older, historical data. This thinking, however, is shortsighted and highlights a missed architectural opportunity.

Regardless of how fast your business needs to process data, streaming can make your software more understandable, more robust and less vulnerable to bugs — if it’s the right tool for the job. Here are three key factors to think about when you consider adding streaming to your architecture.

Factor 1: Understand Your Data’s Time/Value Curve

How valuable is your data? That’s a trick question. It depends on when the data point happened. The vast majority of data has a time/value curve. In general, data becomes less valuable the older it gets.

Now, older data hasn’t commonly been something people talk about in the same breath as streaming. Why? Until somewhat recently, most streaming platforms were created to have relatively small storage capacity. This made sense for their initial homes in bare-metal data centers but has become an unsound pattern since nearly everything has moved to the cloud. The cloud’s access to object storage provides near-limitless storage capacity.

Many streaming platforms integrate directly with those stores and carry through the same storage capacity improvements. This matters because it takes forced retention decisions out of the equation when it comes to streaming. You no longer need to decide how long you can keep data in a stream — you simply keep it as long as it makes sense.

One of the most exciting use cases for historical streams is backtesting online machine learning models. Teams often find that when they deploy a trained model to production, they need to change it in some way. But how can they be sure their new model works well? The very best outcome is to test it against all of the historical traffic, and because streaming is lossless, that is exactly what you get.

If your data’s time/value relationship makes sense, streaming is a great way to get value out of both ends of the curve.

Factor 2: Decide on the Direction of Data Flow

In the old days of software engineering, many things were written with polling — periodic checks to see if something happened. For instance, you might periodically poll a database table to see if a row was added or changed. For a lot of reasons, this is a recipe for disaster because many things can change since the last time you checked, and you won’t know what all the changes are.

Streaming’s superpower is that it forces you to think in terms of lossless, unidirectional dataflows instead of mutable, bidirectional process calls. This gives you a simple model for understanding how systems communicate, regardless of whether data is real time or historical. Instead of polling, you can listen for updates from a system and guarantee that you’ll see every change that happens in the order they occurred. To address the example above, change data capture has become the de facto solution for listening to database changes.

When you think about whether streams are useful for your problem, set aside latency and ask yourself: Does my system benefit from this kind of push model? Are lossless updates important?

Factor 3: Pick an Expiration Strategy

Unbounded, historical streams are great, but there will always come a time when it makes sense to delete your data, perhaps due to GDPR compliance or changes to your business. How do you reconcile streaming’s key primitive — an immutable log of data — with deletion, a mutable operation?

There are two common ways to address this. The first is to implement expiry policies which enable data systems to get rid of data after a certain time period, like a time-to-live (TTL). A variation on that is compaction, where a record’s historical revisions get purged after a certain timeframe.

The second is a bit more sophisticated and uses encryption. An encrypted payload is only useful if you have the decryption key. In general, deleting a payload’s encryption key is seen as a mistake, but not if you want to prevent anyone from ever seeing that data again! In some systems, intentionally deleting encryption keys, and then later deleting the actual dataset, is a simple solution to taking data offline.

It’s hard to predict the future of software, but one constant is that there will always be new technologies on the horizon. When you consider streaming for your use case, it’s important to think about these key questions: Is a push model helpful? Is ordered access to older data useful? Is there a simple way to delete old data? If the answer is yes to these questions, you’re investing in streaming technology for the right reasons, and it’s hard to go wrong when you do that.

The post Historical Data and Streaming: Friends, Not Foes appeared first on The New Stack.

Bosch IoT and the Importance of Application-Driven Analytics

Jay Runkel — Thu, 09 Mar 2023 16:30:27 +0000

“Without data, it’s just trial and error.” So says Kai Hackbarth. He’s a senior technology evangelist at Bosch Global Software with over 22 years of industry experience in the Internet of Things. “Be it automotive, be it industrial, be it smart-home, we’ve done it all,” Hackbarth said. Except for tires. That’s his latest focus and technology challenge.

“Sounds maybe a bit simple,” Hackbarth said, “but if you think about it more deeply, [it’s really complex].”

Because, as it turns out, tires can collect a lot of different pieces of data that can tell a system a lot about what’s going on with a lot of different things related to the car at any given moment.

“Pressure, temperature, accelerometer,” Hackbarth said. “And then you also have other data from the car that’s critical for safety and sustainability.”

But to be of any value, that data needs to be analyzed as close to the source as possible and in real time. Why?

“It’s safety-critical,” Hackbarth said. “If you send all the raw data to the cloud this consumes a lot of costs.”

Chief among those costs: Time.

In order to react to an issue, that data cannot be historical. Because historical data about a tire that’s losing pressure or hydroplaning isn’t helpful to the applications inside the car that need to respond to these developments when they’re happening

And thankfully for Hackbarth and his team at Bosch, there exists the ability to bring real-time analytics into their applications that can handle lots and lots of data.

Smarter Applications Start with Built-in Analytics

Traditionally, applications and analytics have existed as two separate classes of workloads with different requirements, for example: read and write access patterns, as well as concurrency and latency. As a result, businesses have usually deployed purpose-built data stores — including both databases for applications and data warehouses for analytics — and piping or duplicating the data between them.

And that’s been fine when analytics don’t need to affect how an application responds in real time. But most customers expect applications to take intelligent actions in the moment, rather than after the fact.

The same principle applies to Bosch’s tire project. The applications inside the car that can autonomously brake when approaching another vehicle too fast or slow down if the tire senses that it’s hydroplaning, need to also analyze all the data from all the sensors in real time.

This process of building real-time analytics into applications is known as “application-driven analytics.” And it’s how applications get smarter. Be they e-commerce apps on your phone, safety apps in your car or those that monitor rocket launches.

The question for many development teams, though, is how do you build this capability into your applications easily? For a long time, that’s been a difficult question to answer.

A Platform for Building Real-Time Analytics into Your Apps

“From my history [of doing this for] 22 years,” Hackbarth says, “we never had the capabilities to do this before.”

Previously, teams everywhere — not just at Bosch — used to have to do a lot of custom engineering work to do real-time analytics close to the source, including:

Stitching together multiple databases to handle different data structures (documents, tables, time series measurements, key values, graph), each accessed with its own unique query API.
Building ETL data pipelines to transform data into required analytics formats and tier it from the live database to lower-cost object storage.
Spinning up a federated query engine to work across each data tier, again using its own unique query API.
Integrating serverless functions to react to real-time data changes.
Standing up their own API layers to expose data to consuming applications.

All of which result in a multitude of operational and security models to deal with, a ton of data integration work and lots of data duplication.

But that’s no longer the case. There are now data platforms that bring together operational and analytical workloads into one, and that, in turn, allow you to bring live, operational data and real-time analysis together.

MongoDB Atlas — the platform with which I am most familiar (since I work at MongoDB) — allows developers to do this by providing an integrated set of data and application services that fit their workflows.

Developers can land data of any structure, index, query and analyze it in any way they want, and then archive it. And they can do all of this with a unified API and without having to build their own data pipelines or duplicate data.

This is the platform on which the Bosch team continues to build its solutions.

“Without data,” Hackbarth says, “it’s just trial and error.” But now with data and with a single platform to build real-time analytics into their applications, it’s something concrete, something responsive and something actionable. It’s something smart. Especially in the most critical of use cases that Bosch is working on. Like tires.

If you’d like to learn more about how to build application-driven analytics into your applications, I am going to be hosting a three-part livestream demo showing how to build application-driven analytical applications in MongoDB Atlas.

During this three-part series, we will build real-time analytical capabilities for a simulated system for managing rocket launches.

Part One will cover the basics: Building complex analytical queries using MongoDB’s aggregation framework and building visualizations using charts.
Part Two will highlight some additional Atlas capabilities that are often invaluable when building app-driven analytics: Atlas search, triggers and embedding Charts generated visualizations into an application UI.
Part Three will focus on how to use Atlas Data Federation, Atlas Data Lake and Atlas SQL Interface to perform analytics using large sets of historical data and federated queries across multiple data sources.

The first livestream is at 10 a.m. on March 15, and the subsequent streams are scheduled for March 22 and 29.

The post Bosch IoT and the Importance of Application-Driven Analytics appeared first on The New Stack.

How to Work with Containers in TrueNAS

Jack Wallen — Sat, 04 Mar 2023 14:00:24 +0000

TrueNAS is a Network Attached Storage software you can deploy to your LAN or a third-party cloud host. But don’t be fooled by the “NAS” part of the name, as this platform can do much more than just storage. In fact, there are a number of other features that can be added to or used by TrueNAS, such as virtual machines and even containers.

That’s right, you can work with Docker images in TrueNAS and deploy apps like Colabora, Nextcloud, and more. Even better, it’s much easier than you might think.

I’m going to walk you through the steps required to get TrueNAS ready to work with Docker Images.

What You’ll Need

To work with Docker images and other apps, you’ll need TrueNAS up and running. You can deploy the platform as a virtual machine with VirtualBox, so long as you’ve enabled Nested VT-x/ADM-V support in the Processor tab of the System section in Settings for the VM.

Without that feature enabled, TrueNAS won’t be able to use virtual machines or Docker images. You’ll also need to have added at least two extra drives to the VirtualBox VM, in order to create a Pool for TrueNAS to use.

Adding New Drives to VirtualBox

This is actually easier than you might think because you don’t have to bother formatting the added drives (TrueNAS will take care of that for you).

To add a new drive to the VirtualBox VM, make sure the virtual machine is stopped (not paused). Select the VM from the VirtualBox left sidebar and click Settings. In the Storage section of the Settings window, select Controller: SATA and click the far right green + (Figure 1).

Figure 1: My VM is still running, so I can’t add the new drives.

The drive wizard will open. Make sure to create a drive with a fixed size, otherwise, it might cause problems with the pool (or TrueNAS won’t be able to successfully create the pool).

You must add two new drives to the VM, in order for TrueNAS to create the pool, so re-create the same steps. Once you’ve done this, close the settings and boot the VM.

Adding a Storage Pool

After logging into TrueNAS, click Storage and, when prompted, click Create Pool (Figure 2).

Figure 2: The Storage Dashboard needs a pool.

In the resulting window (Figure 3), click Suggest Layout and your newly added drives will show up in the right pane. Make sure to give the pool a name and then click Create.

Figure 3: Creating a new TrueNAS pool.

The pool should be created fairly quickly. Once it’s done, you’re ready to move on.

Add a New Catalog

By default, TrueNAS has a number of pre-defined apps to install. To install a different app, you must add a new catalog, which will include a list of available containerized applications that can be installed. To do this, click Apps and then click the Manage Catalogs tab (Figure 4).

Figure 4: The Manage Catalogs tab in the Applications window of TrueNAS.

Click Add Catalog and in the resulting popout (Figure 5), add the following:

Catalog Name: truecharts
Repository: https://github.com/truecharts/catalog

Figure 5: Adding a new catalog to TrueNAS, so more applications are available for installation.

Click Save to save the information. Because there are a lot of apps in this catalog, it will take some time for this process to complete (between 5-20 minutes, depending on your network speed). When the pull completes, click on the Apps tab again and you should see a considerable amount of containerized apps that can be deployed (Figure 6).

Figure 6: There are now many more apps available for installation.

Locate the app you want to add and click the associated Install button. Each of these applications is containerized, so when you click Install, you’ll find some of the configuration options (Figure 7) to be quite familiar (if you’ve worked with containers before).

Figure 7: Deploying the Focalboard app with TrueNAS.

Once you’ve gone through the entire configuration, scroll to the bottom and click Save. The container will deploy, based on the chart and your configurations. Once the deployment completes, you can access the app.

You can check the status of the deployment by clicking Installed Applications, where you should see the app listed as ACTIVE (Figure 8).

Figure 8: Focalboard successfully deployed.

Click Open and a new tab will open to the app you’ve installed. In the case of Focalboard, you’ll be taken to a login screen, where you can click to register a new account.

And that’s how you can deploy containerized applications using TrueNAS. With this easy-to-use system, you can use TrueNAS as a launching pad to deploy any number of applications to your LAN, from project management tools, cloud services, development tools, and more. All the while, you also have an outstanding network-attached storage solution available to you and your organization.

The post How to Work with Containers in TrueNAS appeared first on The New Stack.

From a Fan: On the Ascendance of PostgreSQL

Josh Long — Fri, 03 Mar 2023 15:12:56 +0000

The Stack Overflow 2022 developer survey has been out for a couple of weeks now, and it’s jam-packed with good information to help end users and tech vendors alike get a better handle on what software developers are into these days. The survey is based on responses from more than 70,000 participants representing a global community of developers with diverse backgrounds, interests and levels of experience. You’ll find sections exploring everything from which technologies they love and are interested in using to their favorite asynchronous collaboration tools and how much they make.

One section of the report I’m particularly excited about is a topic near and dear to my heart (and brain!): databases.

Pro Devs ♥️ Postgres

The survey polled more than 63,000 developers about which databases they most prefer. In total, MySQL won out (46.85% to 43.59% across all respondents). But, importantly, among those who self-identify as professionals (as opposed to students), PostgreSQL won 46.48% to 45.68%. Barely a win, but a win, nonetheless. What a remarkable thing. Our long nightmare is over: People are giving PostgreSQL the love it deserves. Finally!

Admittedly, I’m biased, and not only because the company I work for has been active in Postgres for decades. I think PostgreSQL is just the better tool. So, yes, I might have leaped for joy when Oracle bought and then tried to sabotage MySQL some years ago, because it gave people a good reason to take a second look at PostgreSQL.

Databases Endure

As one of the most complex pieces of the stack, databases are notoriously difficult to migrate. I can only hazard that this “database durability” is the reason why MySQL won out among those who self-identify as learning to code. Twenty years ago, MySQL was everywhere. It’s the “M” in the “LAMP” stack, for Pete’s sake! It was easy to install on every operating system, and it was open source. In contrast, PostgreSQL didn’t even work on Windows until version 8.0 in 2005, a full 13 years after it debuted. It’s not hard to imagine why PostgreSQL’s upkeep was so much slower, and PostgreSQL ceded a lot of ground. So while MySQL’s inertia means it’s still the first thing those learning to code reach for, I hope that will change.

PostgreSQL ♥️s Containers

PostgreSQL is a fantastic tool, more so in container-based architectures. Meanwhile, the Kubernetes operator model has made it easy to spin up, provision and manage multiple database instances in a cluster.

In today’s world, with Docker containers, it’s trivially easy to get a working PostgreSQL or MySQL instance up and running. Here’s a definition you could use for your docker-compose.yaml file:

```yaml
version: '3'
services:
 Postgres:
   image: postgres:latest
   environment:
     - POSTGRES_USER=postgres
     - PGUSER=postgres
     - POSTGRES_NAME=postgres
     - POSTGRES_PASSWORD=postgres
   ports:
     - "5432:5432"
```

Then just run:

```shell
docker-compose up
```

Assuming you’ve installed the psql CLI, you can connect with:

```shell
PGPASSWORD=postgres psql -U postgres -h localhost postgres
```

Easy! I can see why professional devs love PostgreSQL.

Here is another example of how easy it is to do using the VMware Postgres operator.

Some easy setup:

Install the operator:

Deploy Postgres. Within a few seconds, you have a running Postgres server.

Enterprises ♥️ PostgreSQL

PostgreSQL has all the usual features: ACID compliance, scalability, data integrity, excellent security, etc. Now, I know you’re not all that impressed with this stuff today, when plenty of SQL (and some No/NewSQL!) databases purport to offer it, but it is essential.

Remember, there was a time when MySQL’s (v3 and earlier) default backend engine, MyIASM, would happily accept DDL (Data Definition Language) foreign key references enforcing data integrity and then quietly ignore them, letting users put in whatever they wanted. If I can’t trust my database to hold correctly — literally hold on to it — then it’s hardly a database, it’s more just… a base.

Real Talk

Why do I really love PostgreSQL? To start, because of how well it works in Spring and Java applications. So whether you’re using reactive APIs or plan to use the new project Loom with the traditional JDBC (Java database connectivity) APIs, you can bet your bottom dollar PostgreSQL offers a great developer experience, performs well and most likely meets your use cases. Going a step further, PostgreSQL was one of the early databases to support MVCC (multiversion concurrency control), a pattern designed to keep consistency for both reads and writes without blocking each other.

Now, would I say it does everything you’d ever need, as well as the most expensive databases, like Oracle? Probably not. But that’s not the value proposition. It’s the fact that it does most of what I want, and the exhaustive ecosystem fills in the gaps.

I remember learning in 2008 about something called Compiere, an open source ERP system written using Java (and good ol’ Swing) that made extensive use of the Oracle database. Meanwhile, somebody introduced compatibility features to PostgreSQL such that this sprawling, purpose-built ERP system could be made to run on PostgreSQL. Amazing! That’s way beyond compliance with just the specs!

There are a ton of projects aimed at building on PostgreSQL and adding Oracle compatibility (including my own employer). And this, my friends, speaks to the heart of why I love PostgreSQL so much: It’s extensible! There are a ton of projects that change the paradigms supported by PostgreSQL:

Citrus gives you distributed tables for PostgreSQL.
Greenplum, now a part of VMware, debuted in 2005 and is a big data database based on the MPP (massively parallel processing) architecture and PostgreSQL.
TimescaleDB is a time series database, like Netflix Atlas, Prometheus or DataDog, built into PostgreSQL.
Neon Cloud provides bottomless storage for PostgreSQL.
YugabyteDB, whose founders created Apache Cassandra and HBase, among other things, provides a horizontally scalable engine for PostgreSQL.
EnterpriseDB provides an industrial-grade distribution of PostgreSQL.
pgbouncer is a lightweight connection pooler for PostgreSQL.
pgbackrest provides trivial, automated backup and restore

There are tons of extensions to PostgreSQL’s type system, too:

PostGIS provides geospatial queries and capabilities for PostgreSQL.
btree_gin and btree_gist add new BTREE types.
cites supports case-insensitive string values.
cube adds multidimensional cube types.
dblink makes it possible to treat a remote PostgreSQL database as local.
The earthdistance module provides two different approaches to calculating great circle distances on the surface of the Earth.
file_fdw is a foreign data wrapper (of which there are many) to access data files in the server’s file system.
http makes it possible to issue HTTP requests from PostgreSQL.
isn provides data types for the following international product-numbering standards: EAN13, UPC, ISBN (books), ISMN (music), and ISSN (serials).
pg_cron supports running periodic jobs in PostgreSQL.
pg_trgm makes it possible to determine the similarity of alphanumeric text based on trigram matching and index operator classes that support fast searching for similar strings.
pg_audit provides detailed session and object audit logging via the standard logging facility supplied by PostgreSQL.
pgcrypto provides cryptographic functions for PostgreSQL. No bitcoin is required, mercifully!

And there are even extensions to the languages by which you can wield PostgreSQL’s type system! For example, PostgreSQL supports several alternatives to PG/PSQL as languages in which you can write functions:

plcoffee — run CoffeeScript in PostgreSQL with PL/Coffeescript (v8).
pljava supports using the Java language in PostgreSQL.
plls supports using the LiveScript language.
plv8 lets you run JavaScript using the Google V8 OSS JavaScript engine (on which Node.js was initially based).

And nowadays, it is trivial to spin up a PostgreSQL instance anywhere you would like it, on any major cloud vendor. Indeed, all cloud vendors have horizontally scalable, bottomless integrations for their various PostgreSQL instances.

PostgreSQL is ubiquitous, multipurpose, robust, extensible, fast, scalable and more. The Stack Overflow survey results are surprising, in my mind, only because the results were that close. PostgreSQL is amazing! If you haven’t tried it yet, you should.

Jinali Sheth contributed to this article.

The post From a Fan: On the Ascendance of PostgreSQL appeared first on The New Stack.