An OSS Stack for Real-Time AI: Cassandra, Pulsar and Kaskada

Developers building in-the-moment AI apps at scale require a new architecture. Open source is to the rescue.

May 19th, 2023 8:08am by Brian Godsey and Eric Hare

Featued image for: An OSS Stack for Real-Time AI: Cassandra, Pulsar and Kaskada

In the world of academia, the hard problems posed by machine learning revolve around building better, smarter models and finding more and better data. I and my co-author know from our days as Ph.D. students that there’s plenty of time to build models, and data moves slowly — if at all.

But when it comes to data engineering at data-centric enterprises, building offline models for offline data doesn’t cut it in the use cases that companies face today. Every software app with active users generates data constantly; the data moves fast, streaming back and forth between application clients and backend servers.

Use cases for machine learning (ML) in these apps need the backend infrastructure, including the ML and AI model deployments, to be able to scale and keep pace with the data. Insights that are delivered late (or never) lose a lot of their business value.

Online search, recommendation engines, user engagement interventions, video game AI and streaming content are just a few use cases that are better served if the latest, freshest data is fully used in intelligent app features. Taking full advantage of up-to-the-moment data to generate powerful insight is the central goal of real-time AI, and it requires a modern stack that can handle the demands of real-time AI at scale.

Real-Time AI Challenges

Though data speed and quantity are certainly formidable challenges, moving a successful offline ML project into a real-time AI system isn’t as simple as making your infrastructure bigger and faster. There are additional challenges, and every step needs to be automated. For example, “data wrangling” — that much-maligned, messy cleanup step in every data science project — has to be automated, as does the communication and orchestration between applications, services, ML models, streams and data stores. Development and deployment processes need to be implemented around data and ML models.

Many challenges to real-time AI stem from one of three broad root causes.

Event Data Is Heterogeneous

A major reason that real-time AI is difficult is the nature of the data itself. Event data is not only fast and often high volume, but it can also be sporadic, unreliable, unstructured and incompatible with other data and systems. Events usually must be processed, transformed and aggregated via a number of steps within a data pipeline before ML and AI services can use them.

ML Models Require Homogeneity

With few exceptions, input data for an ML model, both training and live production data, needs to be normalized to a single format. This format could be rows of inputs containing numeric feature values; it could be sentences and paragraphs of text; or it could be a time series at regular intervals. In any case, the ML model needs somewhat structured data so that, across training and live data inputs, it is comparing apples to apples, so to speak.

Orchestration and Compatibility

Successful real-time AI deployments require bringing data to ML models at the right time, and in the right format, before communicating results to app clients or wherever else it might be useful. It is helpful when your data storage, transport and processing systems integrate well with one another into an infrastructure stack that works well across many use cases.

Solving the Real-Time AI Infrastructure Formula

Given the demands of deploying and running ML or AI in production at scale, it’s important to choose the right tools for the job. Back in graduate school, we didn’t need (or want) to use the most powerful data storage or streaming platforms, but at large enterprises, as well as the not-so-large, using the best data infrastructure is practically required for an app or service to be a success.

As your enterprise grows, and you graduate from easy-to-use data infrastructure to a more highly scalable, resilient stack, it can be hard to know which tools are worth the effort. Having worked alongside and learned from some amazing software and data engineers over the years, we have a lot of confidence in three open source projects that create a powerful combination to support real-time AI at scale: Apache Cassandra, Apache Pulsar and Kaskada.

Apache Cassandra: Designed for Scale

Some of the largest companies in the world — like Apple, Netflix and Uber — rely on Cassandra to power their data-intensive, AI-powered applications. Cassandra’s distributed architecture provides high availability with no single point of failure, making it a popular choice for organizations that require a scalable, fault-tolerant database. Features include:

Horizontal scalability: As AI applications become more sophisticated, they require the ability to handle ever-increasing volumes of data. Cassandra’s distributed architecture is based on consistent hashing, which enables seamless horizontal scaling by evenly distributing data across nodes in the cluster. This ensures that your AI applications can handle substantial data growth without compromising performance, a crucial factor from a statistical perspective.
High availability: The decentralized architecture of Cassandra provides high availability and fault tolerance, which ensures that your AI applications remain operational and responsive, even during hardware failures or network outages. This feature is especially important for real-time AI applications, as their accuracy and efficiency often rely on continuous access to data for mathematical modeling and analysis.
Low latency: Cassandra’s peer-to-peer architecture and tunable consistency model enable rapid read and write operations, delivering low-latency performance essential for real-time AI applications. With Cassandra, you can ensure that your AI algorithms receive the latest data as quickly as possible, allowing for more accurate and timely mathematical computations and decision-making.
Flexible data modeling: Cassandra’s NoSQL data model is schema-free, which makes it possible to store and query complex and diverse data types common in ML and AI applications. This flexibility allows data scientists to adapt their data models as requirements evolve, without having to deal with the constraints of traditional relational databases.

Apache Pulsar: Speed and Orchestration

Designed for streaming data at scale, Pulsar provides an efficient, reliable and secure platform for processing, storing and transmitting real-time data. Pulsar’s unique architecture combines the benefits of a traditional messaging system with the scalability and durability of a log-based storage system, making it an ideal solution for ML and AI applications. Advantages include:

High throughput: Pulsar’s architecture is designed to support high-throughput data streaming, making it perfect for AI applications that demand real-time data ingestion and processing. With Pulsar, you can efficiently feed your AI algorithms the data they need for sophisticated statistical analysis and predictions.
Low latency: With its distributed architecture and built-in load balancing, Pulsar provides low-latency data streaming, ensuring that your AI applications receive the latest data quickly for real-time decision-making. This feature is especially critical for AI applications that require immediate responses, such as anomaly detection, recommendation engines or time-series forecasting, where statistical models are sensitive to the freshness of the data.
Scalability: Pulsar’s architecture is designed for horizontal scalability, allowing for the easy addition or removal of nodes as data processing requirements change. This means that AI applications can seamlessly scale with data growth, ensuring the ability to handle the increasing demands of real-time AI workloads without compromising models’ statistical integrity.
Multitenancy: Pulsar is built with multitenancy in mind, enabling organizations to share the same messaging infrastructure across multiple applications, teams and use cases. This simplifies resource management and reduces operational overhead, making it easier for data scientists to focus on refining their AI models and algorithms instead of managing infrastructure.
Data durability and reliability: Pulsar’s architecture ensures data durability and reliability, thanks to its built-in replication and acknowledgement mechanisms. These features guarantee that data is securely stored and transmitted, preventing data loss or corruption and ensuring that AI algorithms always have access to the most accurate and up-to-date information for statistical analysis.

One feature that is particularly helpful for ML and AI deployments is Pulsar functions. While more complex ML models might require heavier infrastructure for deployment, models on the simpler side — from Python’s scikit-learn, for example — can be deployed natively in Pulsar via a function, making it unnecessary to deploy a separate Lambda function or other API or service to host the ML model. Simpler infrastructure configuration makes data engineers happy.

Kaskada: Calculations on Event Data for ML

As a new open source project, Kaskada is certainly the least well known of the three described here. However, Kaskada offers a unique capability: It’s an event-processing engine designed specifically for turning event data into continuous, stateful timelines that are consumable by real-time systems.

The current alternatives for continuously generating stateful features for real-time AI seem painfully unfit for anything but the simplest cases. SQL-based tools require writing the same types of time-centric JOIN over and over to get events into various aggregation buckets for ML feature calculations, and the results aren’t natively stateful or continuous, meaning you have to query again and again to try to stay close to real time. Python-based tools give you all of the flexibility to write whatever calculations you want, but you have to write all the logic yourself and then find a way to deploy it. Other event-processing engines handle events and simple logic well but don’t do the complex calculations needed for ML features (or they let you write your own in Python). Kaskada features include:

A time-centric computational model: Kaskada assumes you are working with event data and computing stateful values over time for a set of entities. By default, it handles calculations in the obvious, natural way, which means it automatically JOINs on time, has native event-based windowing, won’t skip a window if it has no events in it and does all of this efficiently in both a development sense and a computational one.
Continuous, declarative expressions: Kaskada’s simple, composable syntax, called Fenl, allows you to declare what you want computed, rather than how it should be computed. Time-centric calculations are native and concise. Related computations can easily share inputs via pipelining syntax.
Native time travel: Time is central in Kaskada’s design, and Fenl makes it easy to shift values forward in time. In ML, we often need to compare prior predictions to actual outcomes for model training or testing. Making this comparison “in the future” is easy with Kaskada; the opposite — letting predictive feature values that were set at one point in time “data snoop” into outcomes from a later time — isn’t allowed. (Kaskada, by default, prevents the building of ML features that already know something about the future.)
Streaming and cloud native: End-to-end incremental execution is efficient by design. Kaskada is easily deployable and highly scalable in the cloud or on your hardware. Columnar compute allows you to execute analytic queries over large historical event data sets in seconds.

Cassandra + Pulsar + Kaskada

Each of these three projects is best-in-class, but why use them together? Well, if you have event data, at scale, and you want to do real-time AI, it’s becoming hard to argue for another stack. Without Pulsar, it’s hard to do a lot of things in real time at web scale. Without Kaskada, it’s difficult to process events into ML features in real time; most people are currently doing it the hard way because Kaskada wasn’t widely available until very recently. Without Cassandra (or Pulsar, for that matter), scaling can become a significant hurdle.

Cassandra and Pulsar already have large established open source development communities. And Kaskada has a small but growing community along with the full commitment of DataStax, due to the natural fit with Cassandra and Pulsar, to deliver on the promise to be the first-choice event-processing engine for real-time AI.

Getting Started

Navigating these technologies and understanding how they can fit into your current architecture can be a challenge. To accelerate development and deployment of real-time AI solutions for your business, DataStax recently introduced Luna ML, a new support service for Kaskada Open Source. Luna ML helps organizations deploy Kaskada and operate modern, open source event processing for ML. Together with the rest of our Luna offerings, we support your entire stack with real-time AI capabilities to derive maximum value from your high-speed, high-volume data.

Learn more about how DataStax enables real-time AI.

Brian Godsey leads data science and machine learning efforts at DataStax, in particular product design, strategy and community for real-time AI and ML. Previously, Brian spent 10 years leading data science at a few early stage startups building data-centric products....