Database and Data Management News & Trends | The New Stack

An E-tailer’s Journey to Real-Time AI: Recommendations

Aaron Ploetz — Thu, 15 Jun 2023 17:41:53 +0000

The journey to implementing artificial intelligence and machine learning solutions requires solving a lot of common challenges that routinely crop up in digital systems: updating legacy systems, eliminating batch processes and using innovative technologies that are grounded in AI/ML to improve the customer experience in ways that seemed like science fiction just a few years ago.

To illustrate this evolution, let’s follow a hypothetical contractor who was hired to help implement AI/ML solutions at a big-box retailer. This is the first in a series of articles that will detail important aspects of the journey to AI/ML.

The Problem

First day at BigBoxCo on the “Infrastructure” team. After working through the obligatory human resources activities, I received my contractor badge and made my way over to my new workspace. After meeting the team, I was told that we have a meeting with the “Recommendations” team this morning. My system access isn’t quite working yet, so hopefully IT will get that squared away while we’re in the meeting.

In the meeting room, it’s just a few of us: my manager and two other engineers from my new team, and one engineer from the Recommendations team. We start off with some introductions, and then move on to discuss an issue from the week prior. Evidently, there was some kind of overnight batch failure last week, and they’re still feeling the effects of that.

It seems like the current product recommendations are driven by data collected from customer orders. With each order, there’s a new association between the products ordered, which is recorded. When customers view product pages, they can get recommendations based on how many other customers bought the current product alongside different products.

The product recommendations are served to users on bigboxco.com via a microservice layer in the cloud. The microservice layer uses a local (cloud) data center deployment of Apache Cassandra to serve up the results.

How the results are collected and served, though, is a different story altogether. Essentially, the results of associations between products (purchased together) are compiled during a MapReduce job. This is the batch process that failed last week. While this batch process has never been fast, it has become slower and more brittle over time. In fact, sometimes the process takes two or even three days to run.

Improving the Experience

After the meeting, I check my computer and it looks like I can finally log in. As I’m looking around, our principal engineer (PE) comes by and introduces himself. I tell him about the meeting with the Recommendations team, and he gives me a little more of the history behind the Recommendation service.

It sounds like that batch process has been in place for about 10 years. The engineer who designed it has moved on, not many people in the organization really understand it, and nobody wants to touch it.

The other problem, I begin to explain, is that the dataset driving each recommendation is almost always a couple of days old. While this might not be a big deal in the grand scheme of things, if the recommendation data could be made to be more up to date, it would benefit the short-term promotions that marketing runs.

He nods in agreement and says that he’s definitely open to suggestions on how to improve the system.

Maybe a Graph Problem?

At the onset, this sounds to me like a graph problem. We have customers who log on to the site and buy products. Before that, when they look at a product or add it to the cart, we can show recommendations in the form of “Customers who bought X also bought Y.” The site has this today, in that the recommendations service does exactly this: It returns the top four additional products that are frequently purchased together.

But we’d have to have some way to “rank” the products, because the mapping of one product to every other purchased at the same time by any of our 200 million customers is going to get big, fast. So we can rank them by the number of times they appear in an order. A graph of this system might look something like what is shown below in Figure 1.

Figure 1. A product recommendation graph showing the relationship between customers and their purchased products.

After modeling this out and running it on our graph database with real volumes of data, I quickly realized that this isn’t going to work. The traversal from one product to nearby customers to their products and computing the products that appear most takes somewhere in the neighborhood of 10 seconds. Essentially, we’ve “punted” on the two-day batch problem, to have each lookup putting the traversal latency precisely where we don’t want it: in front of the customer.

But perhaps that graph model isn’t too far off from what we need to do here. In fact, the approach described above is a machine learning (ML) technique known as “collaborative filtering.” Essentially, collaborative filtering is an approach that examines the similarity of certain data objects based on activity with other users, and it enables us to make predictions based on that data. In our case, we will be implicitly collecting cart/order data from our customer base, and we will use it to make better product recommendations to increase online sales.

Implementation

First of all, let’s look at data collection. Adding an extra service call to the shopping “place order” function isn’t too big of a deal. In fact, it already exists; it’s just that data gets stored in a database and processed later. Make no mistake: We still want to include the batch processing. But we’ll also want to process that cart data in real time, so we can feed it right back into the online data set and use it immediately afterward.

We’ll start out by putting in an event streaming solution like Apache Pulsar. That way, all new cart activity is put on a Pulsar topic, where it is consumed and sent to both the underlying batch database as well as to help train our real-time ML model.

As for the latter, our Pulsar consumer will write to a Cassandra table (shown in Figure 2) designed simply to hold entries for each product in the order. The product then has a row for all of the other products from that and other orders:

CREATE TABLE order_products_mapping (
    id text,
    added_product_id text,
    cart_id uuid,
    qty int,
    PRIMARY KEY (id, added_product_id, cart_id)
) WITH CLUSTERING ORDER BY (added_product_id ASC, cart_id ASC);

Figure 2. Augmenting an existing batch-fed recommendation system with Apache Pulsar and Apache Cassandra.

We can then query this table for a particular product (“DSH915” in this example), like this:

SELECT added_product_id, SUM(qty)
FROm order_products_mapping
WHERE id='DSH915'
GROUP BY added_product_id;

 added_product_id | system.sum(qty)
------------------+-----------------
            APC30 |               7
           ECJ112 |               1
            LN355 |               2
            LS534 |               4
           RCE857 |               3
          RSH2112 |               5
           TSD925 |               1

(7 rows)

We can then take the top four results and put them into the product recommendations table, ready for the recommendation service to query by product_id:

SELECT * FROM product_recommendations
WHERE product_id='DSH915';

 product_id | tier | recommended_id | score
------------+------+----------------+-------
     DSH915 |    1 |          APC30 |     7
     DSH915 |    2 |        RSH2112 |     5
     DSH915 |    3 |          LS534 |     4
     DSH915 |    4 |         RCE857 |     3

(4 rows)

In this way, the new recommendation data is constantly being kept up to date. Also, all of the infrastructure assets described above are located in the local data center. Therefore, the process of pulling product relationships from an order, sending them through a Pulsar topic and processing them into recommendations stored in Cassandra happens in less than a second. With this simple data model, Cassandra is capable of serving the requested recommendations in single-digit milliseconds.

Conclusions and Next Steps

We’ll want to be sure to examine how our data is being written to our Cassandra tables in the long term. This way we can get ahead of potential problems related to things like unbound row growth and in-place updates.

Some additional heuristic filters may be necessary to add as well, like a “do not recommend” list. This is because there are some products that our customers will buy either once or infrequently, and recommending them will only take space away from other products that they are much more likely to buy on impulse. For example, recommending a purchase of something from our appliance division such as a washing machine is not likely to yield an “impulse buy.”

Another future improvement would be to implement a real-time AI/ML platform like Kaskada to handle both the product relationship streaming and to serve the recommendation data to the service directly.

Fortunately, we did come up with a way to augment the existing, sluggish batch process using Pulsar to feed the cart-add events to be processed in real time. Once we get a feel for how this system performs in the long run, we should consider shutting down the legacy batch process. The PE acknowledged that we made good progress with the new solution, and, better yet, that we have also begun to lay the groundwork to eliminate some technical debt. In the end, everyone feels good about that.

In an upcoming article, we’ll take a look at improving product promotions with vector searching.

Learn how DataStax enables real-time AI.

The post An E-tailer’s Journey to Real-Time AI: Recommendations appeared first on The New Stack.

How Dell’s Data Science Team Benefits from Agile Practices

Loraine Lawson — Thu, 15 Jun 2023 16:38:17 +0000

Agile development doesn’t work for data science… at least, not at first, said Randi Ludwig, Dell Technologies’ director of Data Science. That’s because, in part, there is an uncertainty that’s innate to data science, Ludwig told audiences at the Domino Data Lab Rev4 conference in New York on June 1.

“One of the things that breaks down for data science, in terms of agile development practices, is you don’t always know exactly where you’re going,” Ludwig said. “I haven’t even looked at that data. How am I supposed to know where do I even start with that?”

Nonetheless, Dell uses agile practices with its data science team and what Ludwig has found is that while there is a certain amount of uncertainty, it’s contained to the first part of the process where data scientists collect the data, prove there’s value and obtain sign-off from stakeholders. To manage that first part, she suggested time boxing it to three or four weeks.

“The uncertainty really only lies in the first part of this process,” she said. “What that agile looks like in the first half and then the second half of the process are different on a day-to-day basis for the team.”

After the uncertainty period, the rest of the data science process is more like software development, and agile becomes beneficial, she said.

Ludwig interwove how Dell implements agile practices in data science with the benefits the team reaps from those practices.

Benefits of Standups

First, standups should include anyone involved in a data science project, including data engineers, analysts and technical project managers, Ludwig said. Just talking to each other on a regular basis tends to fly in the face of how data scientists inherently work in isolation, but it helps put everyone on the same page and delivers value by adding context and avoiding rework. This pays dividends in that team members can step in for one another more than they can under the “lone wolf” approach to data science.

“Doing standups gives visibility to everybody else in the story,” she said. “That lack of context goes away just by talking to each other every day, and then if you actually write down what you talk about every day, you get other amazing benefits out of it.”

The standup doesn’t necessarily need to be every day, but it should be a recurring cadence that’s short enough that the project can’t go wildly afield, she added.

Benefits of Tickets

Documenting tickets is also a key practice that’s easy to do while alleviating single points of failure, she said, plus tickets have the benefit of not being onerous documentation.

“Just the fact of having things written down and talking to each other every day is massively beneficial, and in my experience is not how data science teams organically develop most of the time,” she said.

In the second half of the data science process, teams can articulate more clearly what exactly they’re going to do so tickets become possible. It’s important not to be too broad when writing tickets, however. Instead, break big ideas down into bite-sized chunks of work, she advised.

“‘I’m going to do EDA (exploratory data analysis) on finance data’ is way too broad. That’s way too big of a ticket. You’ve got to break those things down into smaller pieces,” she said. “Even just getting the team to articulate what are the some of the things you’re going to look for — you’re going to look for missing values, you’re going to look for columns that are high-quality data, you’re going to look to see if there’s any correlations between some of those columns — so that you’re not doing bringing in redundant features.”

It also helps inform the team about the why and how of the models being built. There can also be planning tickets that incorporate questions that need to be asked, she said.

Tickets become another form of data that can be used in year-end reviews and for the management of the team. For instance, one of Ludwig’s data scientists was able to demonstrate through tagged tickets how much time was spent on building data pipelines.

“Data scientists are not best at building data pipelines, you need data engineers for that,” Ludwig said. “This is a great tool because now I know that I need to either redistribute resources I have or go ask for more resources. I actually need more data engineers.”

Tickets can also be used to document problems encountered by the data science team. For instance, Ludwig was able to use tickets to show the database management team all the problems they were encountering with a particular database, thus justifying improvements to that database.

It can be challenging to get members to make tickets and keep them updated, she acknowledged, so she has everyone opened to Github so they can update the tickets during the standup.

Benefits of a Prioritization Log

Tickets also allow the team to create a prioritization log, she said. That triggers a slew of benefits, such as providing the team with support when there is pushback from stakeholders about requests.

“This magical thing happens where now you have stuff written down, which means you have a prioritization backlog, you can actually go through all of the ideas and thoughts you’ve had and figure out how to prioritize the work instead of just wondering,” she said. “You actually foster much less contentious relationships with stakeholders in terms of new asks by having all of the stuff written down.”

Stakeholders will start to understand that for the team to prioritize their request, they need to do some homework such as identifying what data sold be used, what business unit will consume the output of the data and what they think it should look like.

Another benefit: It can keep data scientists from wandering down rabbit holes as they explore the data. Instead, they should bring those questions to the standup and decide as a team for prioritizing.

”This helps you on your internal pipeline, as well as your intake with external stakeholders. Once they see that you have a list to work against, then they’re, ‘Oh, I need to actually be really specific about what I’m asking from you,’” she said.

Finally, there’s no more “wondering what the data science team is doing” and whether it will deliver benefits.

“One of the biggest concerns I’ve ever heard from leadership about data science teams is that they don’t know what your plan’s going to be, what are you going to deliver in 12 or 18 months, how many things I could learn between here that’s going to completely change whatever I tell you right now,” she said. “At least now you know that this investment has a path and a roadmap that’s going to continue to provide value for a long time.”

Benefits of Reviews and Retrospectives

“Stakeholders are just really convinced that people just disappear off into an ivory tower, and then they have no idea what are those data scientists doing,” Ludwig said.

There’s a lot of angst that can be eliminated just by talking with business stakeholders, which review sessions give you a chance to do. It’s important to take the time to make sure they understand what you’re working on, why and what you found out about it, and that you understand their business problem.

Retrospectives are also beneficial because they allow the data science team to reflect and improve.

“One of the things that I actually thought was one of the most interesting about data scientists or scientists at heart, they love to learn, they love to make things more efficient and optimize, but the number of teams that organically just decide to have retrospectives is very small, in my experience,” she said. “Having an organized framework of we’re going to sit down and periodically review what we’re doing and make sure we learn from it is an ad hoc thing that some people do or some people don’t. Just enforcing that regularly has a ton of value.”

Domino Data Lab paid for The New Stack’s travel and accommodations to attend the Rev4 conference.

The post How Dell’s Data Science Team Benefits from Agile Practices appeared first on The New Stack.

Apache SeaTunnel Integrates Masses of Divergent Data Faster

Susan Hall — Thu, 15 Jun 2023 13:58:46 +0000

The latest project to reach top-level status with the Apache Software Foundation (ASF) was designed to solve common problems in data integration. Apache SeaTunnel can ingest and synchronize massive amounts of data from disparate sources faster, greatly reducing the cost of data transfer.

“Currently, the big data ecosystem consists of various data engines, including Hadoop, Hive, Kudu, Kafka, HDFS for big data ecology, MongoDB, Redis, ClickHouse, Doris for the generalized big database ecosystem, AWS S3, Redshift, BigQuery, Snowflake in the cloud, and various data ecosystems like MySQL, PostgreSQL, IoTDB, TDEngine, Salesforce, Workday, etc.,” Debra Chen, community manager for SeaTunnel, wrote in an email message to The New Stack.

“We need a tool to connect these data sources. Apache SeaTunnel serves as a bridge to integrating these complex data sources accurately, in real-time, and with simplicity. It becomes the ‘highway’ for data flow in the big data landscape.”

The open source tool is described as an “ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data.” We’re talking tens of billions of data points a day.

Efficient and Rapid Data Delivery

Begun in 2017 and originally called Waterdrop, the project was renamed in October 2021 and entered the ASF incubator in December the same year. Created by a small group in China, SeaTunnel since has grown to more than 180 contributors around the world.

Built in Java and other languages, and it consists of three main components: source connectors, transfer compute engines and sink connectors. The source connectors read data from the source end (it could be JDBC, binlog, unstructured Kafka or Software as a Service API, or AI data models) and transform the data into a standard format understood by SeaTunnel.

Then the transfer compute engines process and distribute the data (such as data format conversion, tokenization, etc.). Finally, the sink connector transforms the SeaTunnel data format into the format required by the target database for storage.

“Of course, there are also complex high-performance data transfer mechanisms, distributed snapshots, global checkpoints, two-phase commits, etc., to ensure efficient and rapid data delivery to the target end,” Chen said.

SeaTunnel provides a connector API that does not depend on a specific execution engine. While it uses its own SeaTunnel Engine for data synchronization by default, it also supports multiple versions of Spark and Flink. The plug-in design allows users to easily develop their own connector and integrate it into the SeaTunnel project. It currently supports more than 100 connectors.

It supports various synchronization scenarios, such as offline-full synchronization, offline-incremental synchronization, change data capture (CDC), real-time synchronization and full database synchronization.

Enterprises use a variety of technology components and must develop corresponding synchronization programs for different components to complete data integration. Existing data integration and data synchronization tools often require vast computing resources or Java database connectivity connection resources to complete real-time synchronization. SeaTunnel aims to ease these burdens, making data transfer faster, less expensive and more efficient.

New Developments in the Project

In October 2022, SeaTunnel released its major version 2.2.0, introducing SeaTunnel Zeta engine, its data integration-specific computing engine and enabling cross-engine connector support.

Last December it added support for CDC synchronization, and earlier this year added support for Flink 1.15 and Spark 3. The Zeta engine was enhanced to support CDC full-database synchronization, multi-table synchronization, schema evolution and automatic table creation.

The community also recently submitted SeaTunnel-Web, which allows users not only to use SQL-like languages for transformation but also to directly connect different data sources, using a drag-and-drop interface.

“Any open source user can easily extend their own connector for their data source, submit it to the Apache community, and enable more people to use it,” Chen said. “At the same time, you can quickly solve the data integration issues between your enterprise data sources by using connectors contributed by others.”

SeaTunnel is used in more than 1,000 enterprises, including Shopee, Oppo, Kidswant and Vipshop.

What’s Ahead for SeaTunnel?

Chen laid out these plans for the project going forward:

SeaTunnel will further improve the performance and stability of the Zeta engine and fulfill the previously planned features such as data definition language change synchronization, error data handling, flow rate control and multi-table synchronization.
SeaTunnel-Web will transition from the alpha stage to the release stage, allowing users to define and control the entire synchronization process directly from the interface.
Cooperation with the artificial general intelligence component will be strengthened. In addition to using ChatGPT to automatically generate connectors, the plan is to enhance the integration of vector databases and plugins for large models, enabling seamless integration of over 100 data sources.
The relationship with the upstream and downstream ecosystems will be enhanced, integrating and connecting with other Apache ecosystems such as Apache DolphinScheduler and Apache Airflow. Regular communication occurs through emails and issue discussions, and major progress and plans of the project and community are announced through community media channels to maintain openness and transparency.
After supporting Google Sheets, Feishu (Lark), and Tencent Docs, it will focus on constructing SaaS connectors, such as ChatGPT, Salesforce and Workday.

The post Apache SeaTunnel Integrates Masses of Divergent Data Faster appeared first on The New Stack.

Salesforce Officially Launches Einstein AI-Based Data Cloud

Chris J. Preimesberger — Thu, 15 Jun 2023 12:00:32 +0000

Salesforce has been sprinkling its brand of Einstein AI into a bevy of its products during the past couple of years, including such popular services as CRM Cloud and Marketing Cloud. Now it’s going all-out for AI in a dedicated platform.

After introducing it at last September’s Dreamforce conference, the company on June 12 officially launched the dedicated generative AI service it calls Data Cloud — a catch-all subscription service that can be utilized by enterprise IT staff, data scientists and line-of-business people alike.

CEO and co-founder Marc Benioff, speaking to a livestream audience from New York ahead of this week’s re:Inforce conference in Anaheim, Calif., told listeners that since its soft launch to existing customers last fall, Data Cloud has become the company’s “fastest-growing cloud EVER.”

“One of the reasons why this is becoming such an important cloud for our customers is as every customer is preparing for generative AI. They must get their data together. They must organize and prepare their data. So creating a data cloud is becoming that important,” Benioff said.

Einstein Trust Layer Maintains a Safety Shield

Salesforce Data Cloud includes something called the Einstein Trust Layer, a new AI moderation and redaction service that overlays all enterprise AI functions while providing data privacy and data security, Benioff said. The Trust Layer resolves concerns of risks associated with adopting generative AI by meeting enterprise data security and compliance demands while offering users the continually unfolding benefits of generative AI.

“Trust is always at the start of what we do, and it’s at the end of what we do,” he said. “We came up with our first trust model for predictive (AI) in 2016, and now with generative AI, we’re able to take the same technology, and the same idea to create what we call a GPT trust layer, which we’re going to roll out to all of our customers.

“They will have the ability to use generative AI without sacrificing their data privacy and data security. This is critical for each and every one of our customers all over the world,” Benioff said.

Einstein Trust Layer aims to prevent text-generating models from retaining sensitive data, such as customer purchase orders and phone numbers. It is positioned between an app or service and a text-generating model, detecting when a prompt might contain sensitive information and automatically removing it on the backend before it reaches the model.

Trust Layer is aimed at companies with strict compliance and governance requirements that would normally preclude them from using generative AI tools. It’s also a way for Salesforce to address concerns about the privacy risks of generative AI, which have been raised by organizations such as Amazon, Goldman Sachs and Verizon.

How the AI in Data Cloud Works

A real-life example of how AI in the Data Cloud works was offered in a demo by the French sporting goods conglomerate Rossignol, which built its reputation on high-end ski wear and apparel, snowboarding and other winter sports equipment. Due to shortened winters, it is now moving increasingly into the year-round sporting goods market, which includes mountain bikes and other products, so its product SKUs are multiplying fast.

Bringing up a Rossignol product list in a demo for the audience, a company staffer was able to populate the descriptions of dozens of products (already in the company’s storage bank) into a spreadsheet that normally would have taken a team days to research, write and edit. The demo then showed how all those product descriptions could be translated into various languages with a mere series of clicks — again saving a considerable window of time for the marketing person completing this task.

Additional Salesforce News

The company also revealed its intention to infuse Einstein AI GPT into all its services by way of a distribution called GPT for Customer 360. This will make available Einstein AI GPT so enterprises can create “trusted AI-created content across every sales, service, marketing, commerce, and IT interaction, at hyperscale,” Benioff said.

Salesforce revealed new generative AI research. Key data points include:

While 61% of employees use or plan to use generative AI at work, nearly 60% of those don’t know how to do so in a trusted way.
Some 73% of employees believe generative AI introduces new security risks, yet 83% of C-suite leaders claim they know how to use generative AI while keeping data secure, compared to only 29% of individual contributors. This shows a clear disconnect between leadership and individual contributors.

The post Salesforce Officially Launches Einstein AI-Based Data Cloud appeared first on The New Stack.

Can Companies Really Self-Host at Scale?

Jessica Wachtel — Wed, 14 Jun 2023 14:47:06 +0000

There’s no such thing as free lunch, or in this case, free software. It’s a myth. Paul Vixie, vice president of security at Amazon Web Services, creator of the original Domain Name System (DNS), gave a compelling presentation at Open Source Summit Europe 2022 about this topic. His presentation included a comprehensive list of “dos and don’ts” for consumers of free software. Vixie’s docket included labor-intensive, often expensive, engineering work that ran the gamut of small routine upgrades to locally maintaining orphaned dependencies.

To sum the “dos and don’ts” up in one sentence though, engineer(s) are always working, monitoring, watching and ready for action. This “ready for action” engineer must have high-level expertise so that they can handle anything that comes their way. Free software isn’t inherently bad, and it definitely works. Identifying the hidden costs of selecting software also applies to the decision to self-host a database. Self-hosting is effective for many companies. But when is it time to let go and try the easier way?

What Is a Self-Hosted Database?

Self-hosted databases come in many forms. Locally hosted open source databases are the most obvious example. However, many commercial database products have tiered packages that include self-managed options. On-premises hosting comes with pros and cons: low security risk, the ability to work directly beside the data and complete control over the database are a few advantages. There is, of course, the problem with scaling. Self-hosting creates challenges for any business or developer team with spiky or unreliable traffic because on-demand scaling is impossible. Database engineers must always account for the highest amount of traffic with on-premises servers or otherwise risk an outage in the event of a traffic spike.

For businesses that want to self-host and scale on demand, self-hosting in the cloud is another option. This option allows businesses with spiky or less predictable traffic to scale alongside their needs. When self-hosting in the cloud, the cloud provider installs and hosts their database on a virtual machine in a traditional deployment model. When you’re hosting a commercial database in the cloud, support for cloud and the database is minimal because self-hosted always means your engineering resources helm the project. This extends to emergencies like outages and even security breaches.

The Skills Gap

There are many skilled professionals with experience managing databases at scale on-premises and in the cloud. SQL databases were the de facto database for decades. Now, with the rise of more purpose-built databases geared toward deriving maximum value from the data points they’re storing, the marketplace is shifting. Newer database types that are gaining a foothold within the community are columnar databases, search engine databases, graph databases and time series databases. Now developers familiar with these technologies can choose what they want to do with their expertise.

Time Series Data

Gradient Flow expects the global market for time series analysis software will grow at a compound annual rate of 11.5% from 2020 to 2027. Time series data is a vast category and includes any data with a timestamp. Businesses collect time series data from the physical world through items like consumer Internet of Things (IoT), industrial IoT and factory equipment. Time series data originating from online sources include observability metrics, logs, traces, security monitoring and DevOps performance monitoring. Time series data powers real-time dashboards, decision-making and statistical and machine learning models that heavily influence many artificial intelligence applications.

Bridging the Skills Gap

InfluxDB 3.0 is a purpose-built time series database that ingests, stores and analyzes all types of time series data in a single datastore, including metrics, events and traces. It’s built on top of Apache Arrow and optimized for scale and performance, which allows for real-time query responses. InfluxDB has native SQL support and open source extensibility and interoperability with data science tools.

InfluxDB Cloud Dedicated is a fully managed, single-tenant instance of InfluxDB created for customers who require privacy and customization without the challenges of self-hosting. The dedicated infrastructure is resilient and scalable with built-in, multi-tier data durability with 2x data replication. Managed services mean around-the-clock support, automated patches and version updates. A higher level of customization is also a characteristic of InfluxDB Cloud Dedicated. Customers choose the cluster tier that best matches their data and workloads for their dedicated private cloud resources. From the many customizable characteristics, increased query timeouts and in-memory caching are two.

Conclusion

It’s up to every organization to decide whether to self-manage or choose a managed database. Decision-makers and engineers must have a deep understanding of the organization’s needs, traffic flow patterns, engineering skills and resources and characteristics of the data before reaching the best decision.

To get started, check out this demo of InfluxDB Cloud Dedicated, contact our sales team or sign up for your free cloud account today.

The post Can Companies Really Self-Host at Scale? appeared first on The New Stack.

Reducing Complexity with a Multimodel Database

Ted Neward — Tue, 13 Jun 2023 19:42:48 +0000

“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).”

With these words, E.F. Codd (known as “Ted” to his friends) began the seminal paper that begat the “relational wave” that would spend the next 50 years dominating the database landscape.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

When Codd wrote this paper back in 1969, data access was in its infancy: Programmers wrote code that accessed flat files or tables and followed “pointers” from a row in one file to a row in a separate file. By introducing a “model” of data that encapsulated the underlying implementation (of how data was stored and retrieved) and putting a domain-specific language (in this case, SQL) in front of that model, programmers found their interaction with the database elevated away from the physical details of the data, and instead were free to think more along the logical levels of their problem, code and application.

Whether Codd knew this or not, he was tapping into a concept known today as a “complexity budget:” the idea that developers — any organization, really — can only keep track of so much complexity within their projects or organization. When a project reaches the limits of that budget, the system starts to grow too difficult to manage and all sorts of inefficiencies and problems arise — difficulties in upgrading, tracking down bugs, adding new features, refactoring code, the works. Codd’s point, really, was simple: If too much complexity is spent navigating the data “by hand,” there is less available to manage code that captures the complexities of the domain.

Fifty years later, we find ourselves still in the same scenario — needing to think more along logical and conceptual lines rather than the physical details of data. Our projects wrestle with vastly more complex domains than ever before. And while Codd’s model of relational data has served us well for over a half-century, it’s important to understand that the problem, in many respects, is still there — different in detail than the one that Codd sought to solve, but fundamentally the same issue.

Models in Nature

In Codd’s day, data was limited in scope and nature, most of it business transactions of one form or another. Parts had suppliers; manufacturers had locations; customers had orders. Creating a system of relationships between all of these was a large part of the work required by developers.

Fifty years later, however, data has changed. Not only has the amount of data stored by a business exploded by orders of magnitude (many orders of magnitude), but the shape of the data generated is wildly more irregular than it was in Codd’s day. Or, perhaps fairer to say, we capture more data than we did 50 years ago, and that data comes in all different shapes and sizes: images, audio and video, to start, but also geolocation information, genealogy data, biometrics, and that’s just a start. And developers are expected to be able to weave all of it together into a coherent fabric and present it to end users in a meaningful way. And — oh, by the way — the big launch is next month.

For its time, Codd’s relational model provided developers with exactly that — a way to weave data together into a coherent fabric. But with the growth of and changes to the data with which we have to contend, new tactics, ones which didn’t throw away the relational model but added upon it, were necessary.

We wrought what we could using the concept of “polyglot persistence,” the idea of bringing disparate parts together into a distributed system. But as any experienced architect will be all too familiar, the more different and distinct nodes in a distributed system, the greater the complexity. And the more complexity we must spend on manually stitching together data from different nodes in the database system, the less we have to spend on the complexity of the domain.

Nature of Storage

But complexity doesn’t live just in the shape of the data we weave; it also lives in the very places we store it.

What Codd hadn’t considered, largely because it was 50 years too early, is that databases also carry with them a physical concern that has to do with the actual physical realm — the servers, the disks on which the data is stored, the network and more. For decades, an organization “owning” a database has meant a non-trivial investment into all the details around what that ownership means, including the employment of a number of people whose sole purpose is the care and feeding of those machines. These “database administrators” were responsible for machine procurement and maintenance, software upgrades and patches, backups and restorations and more — all before ever touching the relational schema itself.

Like the “physical” details of data access 50 years ago, devoting time to the physical details of the database’s existence is also a costly endeavor. Between the money and time spent doing the actual maintenance as well as the opportunity cost of it being offline and unavailable for use, keeping a non-trivial database up and running is a cost that can often grow quite sizable and requires deep amounts of ongoing training and learning for those involved.

Solutions

By this point, it should be apparent that developers need to aggressively look for ways to reduce accidental and wasteful spending of complexity. We seek this in so many ways; the programming languages we use look for ways to promote encapsulation of algorithms and data, for example, and libraries and services tuck away functionality behind APIs.

Providing a well-encapsulated data strategy in the modern era often means two things: the use of a multimodel database to bring together the differing shapes of data into a single model, and the use of a cloud database provider to significantly reduce the time spent managing the database’s operational needs. Which one you choose is obviously the subject of a different conversation — just make sure it’s one that supports all the models your data needs, in an environment that requires the smallest management necessary.

Multimodel brings all the benefits of polyglot persistence, without the disadvantages of it. Essentially, it does this by supporting a document store (JSON documents), a key/value store and other data storage models (multiple databases) into one database engine that has a common query language and a single API for further access. Learn more about Couchbase’s multimodel database here, and try Couchbase for yourself today with our free trial.

The post Reducing Complexity with a Multimodel Database appeared first on The New Stack.

Case Study: Graph Databases Help Track Ill-Gotten Assets

Joe Fay — Mon, 12 Jun 2023 11:00:42 +0000

In the modern world, information — and money — is digital. Its flow around the world leaves a trail as it moves through onshore and offshore intermediaries, such as lawyers, accountants and banks, and is transformed into other assets, such as property, private jets and yachts.

A series of massive data leaks over the last decade has gifted the International Consortium of Investigative Journalists (ICIJ) vast tranches of digitized data, potentially allowing it to track asset flows worldwide. This data has been pulled together in the ICIJ’s Offshore Leaks database.

Much of the activity covered in these leaks is perfectly legitimate — but some of it will be tied to unlawful or corrupt behavior.

A robust document database could manage these vast amounts of data. But to really understand what is going on within the world of offshore finance, the ICIJ needed to also surface the connections between various players and their assets.

The transactions and relationships contained in the Offshore Leaks data have, in many cases, been deliberately designed to obscure the truth and make it challenging to track asset movements and ascertain who owns what.

A decade ago, tracking these relationships would have meant modeling the data using multiple joined spreadsheets. But this was painstakingly difficult, and investigators still faced the question of how to visualize these relationships for a non-technical audience.

Ultimately, the job calls for a graph database, one that can map connections and relationships, including those that are designed to remain hidden.

How Neo4J Unearths the Secrets in the ICIJ’s Data

The Offshore Leaks data is a potential treasure trove of information about how wealth and assets are diverted into the offshore financial system. But secrecy and obscurity are part of the process. The relationships between individuals, companies, assets and enablers are buried in the data.

But uncovering such hidden relationships and presenting them visually, in a way investigators, journalists and citizens can quickly grasp, is where a native graph database really comes into its own. And this is why the ICIJ uses Neo4j’s platform to analyze its vast amounts of data and reveal the links between various entities.

While traditional relational databases are all about rows and columns, graph databases are all about connections. In Neo4j’s model, data elements are stored as “nodes,” which may be connected by any number of “relationships.” Both the nodes and relationships can have “properties.”

As well as its core database, Neo4j offers a suite of tools to allow developers and data scientists to model, store and query data as a graph. It has its own query language, Cypher. There is a Python wrapper for the Neo4j graph data science library to ease integration into data science workflows. At the other end of the spectrum, there are API integrations to allow, for example, web developers to build web applications that are backed by Neo4j.

This model lends itself to a variety of data access patterns depending on the use case, according to William Lyon, a developer relations engineer at Neo4j. If a journalist simply wants to know a list of offshore companies linked to a sanctioned individual, this will involve a local graph traversal from a well-defined starting point.

At the other end of the scale, a data scientist — ICIJ has both journalists and data scientists on its team — might look to analyze the entire network or run graph algorithms such as PageRank to establish the most important nodes in the network.

The platform is particularly useful both for analyzing nested data and for being able to combine datasets and running queries across them.

“By extracting the entities and the relationships out of all these documents, and then adding them into Neo4j,” Lyon said, “you get this huge graph of how all of these people and offshore companies and assets are connected.”

A “node” would represent an individual, an offshore company or, he said, an address connected to the person or the company.

“And then on those nodes, you can store key-value pair attributes that are called properties, like the name or the passport number that are associated with the node,” Lyon said. “And then we would also add another component called a label; that is a way to group the node.”

The result, said Lyon, is “you’re able to encode those relationship types that are shown in these documents or extracted through this natural language processing in the property graph data model used to model this Offshore Leaks data.”

The ICIJ then uses a visualization tool called Linkurious, which integrates with Neo4j, to enable less technical users to interrogate the graph. Most journalists are not going to be writing SQL or Cypher queries.

Tracking Connections in Messy Data

One of the big problems for the ICIJ is not just the scale of the data involved and the hidden nature of the connections between various players, but the format it arrives in.

The Swiss Leaks investigation in 2015 centered on 3.3GB of leaked data. The Paradise Papers leak in 2017 involved 13.4 million documents amounting to 1.4TB of data, spanning 19 corporate registries.

The Pandora Papers investigation, which hit in 2021, included 11.9 million files from 14 different “offshore service providers”, spanning PDFs, images, emails, spreadsheets, and audio and video files, amounting to 2.94TB.

More recently, the ICIJ has pulled together previous Russia-related investigations in its Russia Archive, which has helped spark action by authorities and regulators in the wake of Russia’s invasion of Ukraine.

But the data from which investigations spring is never handed to reporters on a tidy plate. “The data that we get is usually very problematic,” Emilia Diaz-Struck, ICIJ’s data and research editor, told the New Stack. “It’s very messy, it’s not structured.” For instance, just 4% of the data in the Pandora papers were in structured formats, such as spreadsheets.

So the ICIJ uses a variety of tools for “entity extraction,” including optical character recognition (OCR) and machine learning. “For some of the information that was in documents, we use Python scripts that our team wrote for extracting,” she said.

The team also uses Scikit-learn, a Python machine learning toolkit, “to separate forms from longer documents and then we used OCR to extract the information.” Some investigations have included handwritten documents and this means data must be transcribed manually.

Once the entities have been extracted, the ICIJ must still fact-check and validate the information.

The organization has also developed its own platform, Datashare, which is an open source tool for securely sharing massive amounts of records with everyone involved in a project. Clearly, with as many as 600 journalists on an investigation, it’s not feasible for individual reporters to have to visit a single secure location.

But even when the ICIJ and its partners have extracted this vast amount of information from a mass of unstructured data, it still must use Neo4j’s graph database to connect the dots between individuals, entities and assets, and visualize the results. Or, to put it another way, build a story.

No Need for Complex Queries

The ICIJ’s ultimate aim is to further public interest journalism and democratize the data it obtains. This necessarily means putting that data in the hands of people who are not necessarily technical experts but rather seasoned field reporters, great storytellers or simply highly motivated citizens.

By combining Neo4j’s ability to uncover links and relationships that researchers may not even have dreamed of, together with Linkurious’s data visualization and analysis technology, said Diaz-Struck, it has been able to both construct the graph and provide an interface for people to query it, without the need to code or construct complex queries.

“That’s the powerful thing, the magic,” she said. “They can start typing a name of anyone, or an address, and then they will get suggestions.”

From there, she said, journalists and other researchers can expand their search and realize that a person they thought had one company has connections to multiple companies, or entities. From there, she said, they can return to Datashare and explore the documents themselves.

“This is a great way to find connections and find key information that will help advance their reporting process,” Dias-Struck said. “It helps a lot with discovering stories and interpreting and finding connections.”

Anyone can get a feel for the power of graph databases by checking the Offshore Leaks database. Because, as Diaz-Struck said, ICIJ’s work is about transparency, and answering the question, “How do we democratize access to data and make it available and usable for everyone?”

The post Case Study: Graph Databases Help Track Ill-Gotten Assets appeared first on The New Stack.

Vector Databases: What Devs Need to Know about How They Work

David Eastman — Sat, 10 Jun 2023 12:00:27 +0000

When we say “database” today we are probably talking about persistent storage, relational tables, and SQL. Rows and Columns, and all that stuff. Many of the concepts were designed to pack data into what was, at the time they were created, limited hard disk space. But most of the things we store and search for are still just numbers or strings. And while dealing with strings is clearly a little more complex than dealing with numbers, we generally only need an exact match — or maybe a simply defined fuzzy pattern.

This post looks at the slightly different challenges to traditional tooling that AI brings. The journey starts with a previous attempt to emulate modern AI, by creating a Shakespeare sonnet.

We analyzed a corpus and tried predicting words, a trick played to perfection by ChatGPT. We recorded the distance words appeared from each other. And we used this distance data to guess similar words based on their distances to the same word.

So in the above, if we were to have only two phrases in our corpus, then the word following “Beware” could be “the” or “of”. But why couldn’t we produce ChatGPT-level sonnets? My process was just the equivalent of a couple of dimensions of training data. There was no full model as such, and no neural network.

What we did was a somewhat limited attempt to turn words into something numerical, and thus computable. This is largely what a word embedding is. Either way, we end up with a set of numbers — aka a vector.

At school we remember vectors having magnitude and direction, so they could be used to plot an airplane’s course and speed, for example. But a vector can have any amount of numbers or dimensions attached to it:

x=(x₁, x₂, x₃, … ,x₉)

Obviously, this can no longer be placed neatly in physical space, though I welcome any n-dimensional beings who happen to be reading this post.

By reading lots of texts and comparing words, vectors can be created that will approximate characteristics like the semantic relationship of the word, definitions, context, etc. For example, reading fantasy literature I might see very similar uses of “King” and “Queen”:

Word2Vec explained

The values here are arbitrary of course. But we can start to think about doing vector maths, and understand how we can navigate with these vectors:

King - Man + Woman = Queen

[5,3] - [2,1] + [3, 2] = [6,4]

The trick is to imagine not just two, but a vector of many, many dimensions. The Word2Vec algorithm uses a neural network model to learn word associations like this with a large corpus of text. Once trained, such a model can detect similar words:

Given a large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the text.

Using neural network training methods, we can start to both produce more vectors and improve our model’s ability to predict the next word. The network translates the “lessons” provided by the corpus into a layer within vector space that reliably “predicts” similar examples. You can train on what target word is missing in a set of words, or you can train on what words are around a target word.

The common use of Shakespeare shouldn’t be seen as some form of elite validation of the Bard’s ownership of language. It is just a very large set of accurately recorded words that we all recognize as consistent English and within the context of one man’s endeavors. This matters, because whenever he says “King” or “Queen” he retains the same perspective. If he was suddenly talking about chess pieces, then the context for “Knight” would be quite different — however valid.

Any large set of data can be used to extract meaning. For example, we can look at tweets about the latest film “Spider-Man: Across the Spiderverse,” which has generally been well-reviewed, by those who would be likely to comment or see it:

“That was a beautiful movie.”

“The best animation ever, I’m sorry but it’s true, only 2 or 3 movies are equal to this work of art.”

“It really was peak.”

“..is a film made with LOVE. Every scene, every frame was made with LOVE.”

“We love this film with all our hearts.”

But you can begin to see that millennial mannerisms mixed with Gen Z expressions, while all valid, might cause some problems. The corpus needs to be sufficiently large that there would be natural comparisons within the data, so that one type of voice didn’t become an outlier.

Obviously, if you wanted to train a movie comparison site, these are the embeddings you would want to look at.

Ok, so we now have an idea of what word embeddings are in terms of vectors. Let’s generalize to vector embeddings, and imagine using sentences instead of single words, or pixel values to construct images. As long as we can convert from data items to vectors, the same methods apply.

In summary:

Models help generate vector embeddings.
Neural networks train these models.

What a Vector Database Does

Unsurprisingly, a vector database deals with vector embeddings. We can already perceive that dealing with vectors is not going to be the same as just dealing with scalar quantities (i.e. just normal numbers that express a value or magnitude).

The queries we deal with in traditional relational tables normally match values in a given row exactly. A vector database interrogates the same space as the model which generated the embeddings. The aim is usually to find similar vectors. So initially, we add the generated vector embeddings into the database.

As the results are not exact matches, there is a natural trade-off between accuracy and speed. And this is where the individual vendors make their pitch. Like traditional databases, there is also some work to be done on indexing vectors for efficiency, and post-processing to impose an order on results.

Indexing is a way to improve efficiency as well as to focus on properties that are relevant in the search, paring down large vectors. Trying to accurately represent something big with a much smaller key is a common strategy in computing; we saw this when looking at hashing.

Working out the meaning of “similar” is clearly an issue when dealing with a bunch of numbers that stand in for something else. Algorithms for this are referred to as similarity measures. Even in a simple vector, like for an airplane, you have to decide whether two planes heading in the same direction but some distance away are more or less similar to two planes close to each other but with different destinations.

Learning from Tradition

The final consideration is leveraging experience from traditional databases — there are plenty of them to learn from. So for fault tolerance, vector databases can use replication or sharding, and face the same issues between strong and eventual consistency.

Common sense suggests that there will be strategic combinations of traditional vendors and niche players, so that these methods can be reliably applied to the new data that the AI explosion will be producing. So a vector database is yet another of the new and strange beasts that should become more familiar as AI continues to be exploited.

The post Vector Databases: What Devs Need to Know about How They Work appeared first on The New Stack.

Vetting an Open Source Database? 5 Green Flags to Look for

Julian Moffett — Fri, 09 Jun 2023 18:25:19 +0000

By now, the vast majority of companies (90%, according to a report from GitHub) use open source in some way. The benefits are undeniable: Open source is cost-effective, accelerates innovation, allows for faster iteration, features robust community-driven support and can be a magnet for attracting talent.

While unsupported open source is free, most companies choose to invest in some type of supported open source solution to make their implementation of this technology robust enough to operate at enterprise scale. These solutions provide a sweet spot between the challenges of managing open source oneself and the vendor lock-in associated with proprietary software.

Given open source’s massive popularity, it’s no surprise that a plethora of supported open source solutions exist, but not all open source solutions and providers are created equally. It’s important to vet your options carefully — your mission-critical applications depend on it.

Here are five green flags to look for.

1. The Solution Offers Resiliency

Nobody wants to deal with application downtime: At best, it’s inconvenient and, at worst, it cuts into revenue and can cause reputational damage to a business. So, what happens if you experience a failure in your infrastructure or data center provider? How do you minimize the impact of planned maintenance?

Open source products, more specifically, open source databases, seldom have built-in resiliency solutions to address this.

For this reason, resiliency capabilities are the hallmark of solid open source database solutions. Depending on a company’s recovery time objective (RTO), which can range from seconds to days, businesses should look for holistic open source database solutions that offer database high-availability/disaster recovery in the event of unexpected failure and, in some cases, go further to facilitate uninterrupted application uptime during scheduled maintenance. Backup and restore capabilities, too, are an important part of the resiliency equation, so make sure any solution you adopt supports regular backups (that are actually usable!) at appropriate intervals. Backup capabilities to look for are the ability to perform full backups, incremental backups, point-in-time recovery and selective data restoration.

2. The Solution Features Robust Security

In today’s world, where high-profile data breaches are a frequent occurrence, robust security is vital. From a database perspective, supported open source solutions should provide safeguards like encryption while data is in transit and at rest, plus value-add options such as redaction for sensitive information, like credit card data. This is especially crucial for highly regulated industries like financial services, health care and government that handle our most sensitive data.

Capabilities for enhanced auditing are also important for security, as they let organizations see who did what to a given data set, and at what point in time. Additionally, employing fine-grained role-based access control (RBAC) enables companies to establish specific parameters governing data access, ensuring that information is only visible to individuals on a need-to-know basis. These are just some of the capabilities that can denote superior, safe and secure open source database solutions.

3. Your Provider Gives Back to the Community

Organizations should be invested in giving back to the open source projects their solutions support, so keep an eye out for companies who focus on driving innovation for the greater good of the community. Giving back might include things like providing funding, making significant contributions to the code or educating people on/furthering the message of the project. These are all signs of a true open source partner.

The closer a company is to the open source project its solution supports, the more adept it becomes at understanding and solving its customers’ problems. This is the most effective way it can influence the direction of the project to better support customers while simultaneously driving innovation in the community.

4. It’s True, Non-Captive Open Source

There’s an important difference between offerings that are legitimate open source versus open source-compatible. “Captive” open source solutions pose as the original open source solution from which they originated, but in reality, they are merely branches of the original code. This can result in compromised functionality or the inability to access features introduced in newer versions of the true open source solution, as the branching occurred prior to the introduction of those features. “Fake” open source can feature restrictive licensing, a lack of source code availability and a non-transparent development process.

Despite this, these solutions are sometimes still marketed as open source because, technically, the code is open to inspection and contributions are possible. But when it comes down to it, the license is held by a single company, so the degree of freedom is minute compared to that of actual open source. The key is to minimize the gap between the core database and its open source origins.

Choose solutions with licenses that are approved under the Open Source Initiative (OSI), which certifies that they can be freely used, modified and shared. Signs to look for include solutions that are supported by a robust community rather than driven by a single company. Additionally, solutions that frequently release new versions and features are also indicators of a quality provider.

5. The Solution Is Flexible

The database you choose should be flexible and customizable, allowing for different deployment models, integration with other systems and support for different data types and formats. A truly flexible database service can be deployed in various models, including on-premises, cloud-based, or hybrid and multicloud deployments. It also caters to different infrastructure preferences such as bare metal, hypervisor and Kubernetes. This flexibility can extend into support for multiple data models, allowing users to work with relational, document, graph or other data models within a single service to accommodate different application requirements.

Database services with flexible pricing and billing have the added benefit of allowing users to choose the most cost-effective plan based on their usage patterns. Look for solutions that offer various pricing models, such as pay-as-you-go, subscription-based or tiered pricing to maximize value for your investment.

At the end of the day, when it comes to open source database solutions, appearances can be deceiving. It is crucial for companies to invest additional time in thoroughly evaluating these solutions to avoid getting locked into an undesirable situation. When all is said and done, the rewards of effectively harnessing the power of open source are significant. By remaining vigilant and discerning throughout the evaluation process, you can identify the most suitable solution that truly fulfills your requirements. Look for those green flags.

The post Vetting an Open Source Database? 5 Green Flags to Look for appeared first on The New Stack.

Neil deGrasse Tyson on AI Fears and Pluto’s Demotion

Loraine Lawson — Fri, 09 Jun 2023 13:00:28 +0000

The problem with artificial intelligence, said famed astrophysicist Neil deGrasse Tyson, is that people don’t realize how long they’ve been using technologies that are, essentially, AI. So they think it’s something new. But really AI is something that’s been around for a while, from Google Maps to Siri, he pointed out.

“The dire predictions for AI make very good media clickbait as, of course, the public now thinks of AI as an enemy of society without really understanding what role it has already played in society,” Tyson said. “I think once you become accustomed to something, you no longer think of it as AI. I can talk into my cell phone and say, ‘Where’s the nearest Starbucks, I want to get there before it closes and I need the shortest traffic route,’ and [it] gives you that answer in moments, and not a single human being was involved in that decision. So again, this is not a computer doing something rote. It’s a computer figuring stuff out that a human being might have done and would have taken longer. Nobody’s calling that AI — why not?”

Tyson, who directs the Hayden Planetarium, spoke last week in New York at Rev 4, a data science and analytics conference held by Domino Data Lab. Tyson pointed out that computers and AI have been doing the tasks of humans for some time now.

“Part of me sees what’s happened in recent months, where this AI power has crossed over this line in the sand and now it’s affecting people in the liberal arts world. It can now compose their term paper and they’re losing their shit over it. They’re freaking out over this,” Tyson said. “And I think ‘What do you think it’s been doing for the rest of us for the past 60 years?’ When it beat us at chess? Did you say, oh my gosh, ‘That’s the end of the world?’ No, you didn’t, you were intrigued by this. It beat us at Go, it beat us at Jeopardy. Now it can write a term paper, and you’re freaking out.”

He acknowledged that guidance is needed with AI, as it is with any powerful tool, but pointed out that he doesn’t think it’s uniquely placed to end civilization relative to other powerful tools — “We’ve created nuclear weapons that are controlled by computers,” he added.

“Yes, you put in some checks and balances, but the idea that some humanoid robot is going to come out, that’s not the direction we’re going,” he said. ”It’s a hard problem, because people fear what they don’t understand. And you have the captains of industry saying, ‘We should fear this.’ We presume they understand what they’re talking about. So my reply here is, yes, we should fear it enough to monitor our actions closely about what you’re doing with it.”

Tyson sat on the Defense Innovation Board at the Pentagon, where they talked about the role of AI and a kill decision. If there’s such a thing as the ethics of war, then AI can never make that ultimate decision, so the board recommended there must be a human in the loop and the military adopted the recommendation.

That said, AI’s ability to create deep fakes, from voice to video, may finally break the internet, he cautioned. It will even make it hard to peddle conspiracy theories like Pizzagate, he said.

“Nobody can trust anything. Even the people who didn’t used to trust things, they can’t even trust the things that were wrong that they trusted. So that’s basically the end of the internet,” Tyson said. “People will return to having one on one conversations with each other and actually calling people on the phone and the internet will just be this a playground of fake things. The tombstone [will be] internet 1992 to 2024 — 32 years, it had a good run, rest in peace.”

Tyson challenged the audience with reflections on data, including a look at how bad data led to Pluto becoming — then unbecoming — a planet. It was first identified as a planet because Neptune’s orbit didn’t follow Newton’s Law, leading astrophysicists to believe there must be a Planet X out there that was affecting Neptune’s orbit. Astronomers found the space where Planet X should have been and there was a small object they named Pluto. The moon has five times the mass of Pluto and there’s no way something so small could have disrupted Neptune’s orbit, he said.

“I have hate mail from children,” Tyson said. “I was implicated in this demotion. I didn’t demote, but I was definitely an accessory. I definitely drove a getaway car on this one.”

It was a problem of bad data collected over 10 years by the US Naval Observatory, he said. Once that data was removed, Neptune “landed right on” Newton’s Law, eliminating the need for Planet X.

In a similar vein, Mercury’s orbit does not follow Newton’s Law, which led to another search for a hypothetical planet called Vulcan (after the Roman god, not Spock‘s home planet) until Albert Einstein’s theory of relativity.

“1916, Albert Einstein introduces an upgrade in the laws of physics, the laws of motion and the laws of gravity, the general theory of relativity demonstrating that under strong gravitational fields, the laws of motion do not follow Newton’s law,” he said. “It’s general relativity. It’s a different physics model. Vulcan died overnight — it was unnecessary.”

Data and even the frameworks in which the data is used can be flawed, he added.

“Even if the analysis is accurate within itself, the fact that you do this analysis instead of that is what could be flawed,” he told the audience of data scientists.

Domino Data Lab paid for The New Stack’s travel and accommodations to attend the Rev4 conference.

The post Neil deGrasse Tyson on AI Fears and Pluto’s Demotion appeared first on The New Stack.

How Apache Airflow Better Manages Machine Learning Pipelines

Alex Williams — Thu, 08 Jun 2023 22:13:55 +0000

VANCOUVER — What is apparent with Apache Airflow, the open source project for building pipelines in machine learning? The experience is getting even easier, as illustrated in a discussion on The New Stack Makers with three technologists from Amazon Web Services.

Apache Airflow is a Python-based platform to programmatically author, schedule and monitor workflows. It is well-suited to machine learning for building pipelines, managing data, training models, and deploying them.

Airflow is generic enough for the whole pipeline in machine learning. Airflow fetches data and performs extraction, transformation and loading (ETL). It tags the data, does the training, deploys the model, tests it and sends it to production.

In an On the Road episode of Makers recorded at the Linux Foundation’s Open Source Summit North America, our guests, who all work with the AWS Managed Service for Airflow team, reflected on the work on Apache Airflow to improve the overall experience:

Dennis Ferruzzi, a software developer at AWS, is an Airflow contributor working on project API-49, which will update Airflow’s logging and metrics backend to the OpenTelemetry standard. The API will allow for more granular metrics and better visibility into Airflow environments.

Niko Oliveira, a senior software development engineer at AWS, is a committer/maintainer for Apache Airflow. He spends much time reviewing, approving and merging pull requests. A recent project included writing and implementing AIP-51 (Airflow Improvement Proposal), which modifies and updates the Executor interface in Airflow. It allows Airflow to be a more pluggable architecture, which makes it easier for users to build and write their own Airflow Executors.

Raphaël Vandon, a senior software engineer at AWS, is an Apache Airflow contributor working on performance improvements for Airflow and leveraging async capabilities in AWS Operators, the part of Airflow that allows for seamless interactions with AWS.

“The beautiful thing about Airflow, that has made it so popular is that it’s so easy,” Oliveira said. “For one, it’s Python. Python is easy to learn and pick up. And two, we have this operator ecosystem. So companies like AWS, and Google and Databricks, are all contributing these operators, which really wrap their underlying SDK.”

‘That Blueprint Exists for Everyone’

Operators are like generic building blocks. Each operator does one specific task, Ferruzzi said.

“You just chain them together in different ways,” he said. “So, for example, there’s an operator to write data to [Amazon Simple Storage Service]. And then there’s an operator that will send the data to an SQL server or something like that. And basically, the community develops and contributes to these operators so that the users, in the end, are basically saying the task I want to do is pull data from here. So I’m going to use that operator, and then I want to send the data somewhere else.

“So I’m going to go and look at, say, the Google Cloud operators and find one that fits what I want to do there. It’s cross-cloud. You can interact with so many different services and cloud providers. And it’s just growing. We’re at 2,500 contributors now, I believe. And it’s just like people find a need, and they contribute it back. And now that block, that blueprint exists for everyone.”

Airflow 2.6 has an alpha for sensors, Vandon said. Sensors are operators that wait for something to happen. There are also notifiers, which get placed at the end of the workflow. They act depending on the success (or not) of the workflow.

As Vandon said, “It’s just making things simpler for users.”

The post How Apache Airflow Better Manages Machine Learning Pipelines appeared first on The New Stack.

Generative AI: What’s Ahead for Enterprises?

Heather Joslyn — Thu, 08 Jun 2023 18:13:38 +0000

There’s been a lot of speculation and hand-wringing about what impact ChatGPT and other generative AI tools will have on employment in the tech industry, especially for developers. But what is its potential for organizations and businesses? What new opportunities lie ahead?

In this episode of The New Stack Makers podcast, Nima Negahban, CEO of Kinetica, spoke to Heather Joslyn, features editor of The New Stack, about what could come next for companies, especially when generative AI is paired with data analytics.

The conversation was sponsored by Kinetica, an analytic database.

There’s an obvious use case, Negahban told us, one that will result in a transformative “killer app”: “An Alexa for all your data in your ecosystem in real time. Where you can ask, ‘What store is performing best today?’ Or, “What products underperform when it’s raining?’ Things like that, that’s within the purview, in a very short order of what we can do today.”

The result, he said, could be “a whole new level of visibility into how your enterprise is running.”

An Expectation of Efficiency

Two big challenges loom in the generative AI space, Negahban said. One, security, especially when using internal data to help train an AI model: “Is it OK to send the necessary information that you need to a large language model?”

And two, accuracy — making sure that the AI outputs aren’t riddled with hallucinations. “If my CEO is asking a question, and [generates that analytic on the fly and gives them an answer, how do we make sure that it’s right? How do we make sure that the information that we’re going to give that person is correct, and it’s not going to put them down a false path?”

For developers, generative AI — including tools like GitHub Copilot — will bring a new expectation of efficiency and innovation, Negahban said.

For both devs and product managers, he said, it can spur creativity; for instance, he said it can enable them “to make new features that previously you wouldn’t have been able to think of?”

The Future: Orchestration and Vector Search

Much remains to be discovered about using generative AI in the enterprise. For starters, the current models are basically “text completion engines,” Negahban noted. “How do you orchestrate that in a way that can actually accomplish something? That’s a multistep process”.

In addition, organizations are just starting to grapple with how to leverage the new technology with their data. “That is part of the reason why the vector search, vector database and vector search capability world is exploding right now,” he said. “Because people want to generate embeddings, and then do embedding search.”

Kinetica’s processing engine handles ad hoc queries in a performant way, without users needing to do a lot of pre-planning, indexing or data engineering. “We’re coupling that engine with the ability to generate SQL on the fly against natural language,” he said, with Open AI technology trained on Kinetica’s own Large Language Model.

The idea, Negahban said, is “if you can take that killer app and marry it with an engine that can do the querying, in a way that’s performing in a way that doesn’t require whole teams of people to prepare the data, that can be exceptionally powerful for an enterprise.”

Check out the entire episode to get more insight.

The post Generative AI: What’s Ahead for Enterprises? appeared first on The New Stack.

What Is Amazon QuickSight, and How Does It Uncover Insights?

Casey Samulski — Thu, 08 Jun 2023 15:54:07 +0000

Every modern business collects data, but the most effective businesses know how to analyze that information and make decisions with it. If you’re looking for a sophisticated, yet cost-effective, tool that helps you glean insights from your data, Amazon QuickSight might be the answer.

Amazon QuickSight is an analytics service that provides business intelligence (BI) and data-visualization capabilities to help you make sense of your data quickly. The National Football League (NFL), for example, uses QuickSight to power its players’ Next Gen Stats, which charts every player’s individual movement. Global mining company Rio Tinto uses it to find critical insights on risk management. Meanwhile, Siemens AG creates operational dashboards that automatically scale to its needs with QuickSight.

Let’s look at the features and benefits of QuickSight and how to get the most value from it.

What Is Amazon QuickSight?

QuickSight’s interactive dashboards and visualizations for data exploration help companies uncover actionable insights in their data. The service takes in data from sources such as databases, files or streaming data.

As one of the fully managed cloud-based tools from Amazon Web Services (AWS) , QuickSight replaces traditional on-premises BI tools that require significant upfront investment in hardware. With this type of service, you also save in the long run on software licenses and IT resources required for on-premises options. QuickSight also allows users to access and analyze data in real time, an improvement over complex data modeling and extract, transform and load (ETL) processes.

Here’s a closer look at the key benefits:

Compatibility with Data Sources

Amazon QuickSight is compatible with many data sources, including on-premises databases, APIs, Internet of Things (IoT) devices, CSV files and software-as-a-service (SaaS) applications. QuickSight also works with various data storage services, including Amazon S3, and can query relational databases such as those powered by Presto.

QuickSight also supports incremental data uploads, enabling users to upload new data to a S3 bucket or file.

Powerful SPICE Engine

The service’s SPICE engine (super-fast, parallel, in-memory calculation) enables you to analyze large volumes of data quickly and efficiently. SPICE uses in-memory calculations to speed up query response times, automatically optimizes queries to minimize data processing and caches data for faster access.

With SPICE, you can perform ad hoc analysis, create data visualizations and run interactive dashboards on large datasets, all while maintaining fast query response times.

Easy to Use

QuickSight’s intuitive, drag-and-drop interface makes it easy to create and share interactive dashboards, visualizations and reports. You don’t need to be a data expert to get started with data analysis.

It also automatically formats your data for optimal visualization. You get professional-grade visuals without the need for extensive in-house design or programming skills. This leaves more time and budget for analyzing data and applying ML insights rather than needing to clean up low-quality visualizations.

QuickSight supports 10 major languages, and with free iOS and Android apps, the service is accessible to employees on the go, in the office or anywhere else.

It offers numerous pricing options, including enterprise and standard editions, as well as discounts for bulk purchases and annual subscriptions.

Available Integrations

QuickSight integrates with a range of AWS offerings, including Amazon Redshift, Amazon RDS and Amazon S3. The service also integrates with third-party services, including other popular BI tools such as Tableau Software and Microsoft Power BI.

How to Use Amazon QuickSight

Using Amazon QuickSight to analyze and visualize your data is a straightforward process that involves a few key steps. Here’s a closer look at how to get started.

Understand the Interface

Start by understanding Amazon QuickSight’s user interface. The user dashboard provides quick access to visualizations and data sets, as well as links to features such as sharing visuals or personalized settings.

The dashboard is customizable, allowing users to select different metrics or dimensions for each visualization they create. The interface is intuitive, making it easy for users to navigate and find the features they need quickly.

For example, QuickSight comes with drag-and-drop features, enabling users to easily build and customize their visuals with charts, graphs, tables and maps. It also offers prebuilt templates and visualizations for users who need professional-quality visuals but are short on coding or design skills.

Create and Refine Data Sets

Users create data sets by connecting a data source or uploading a file to the platform. Once connected, users refine their data using methods such as filtering, pivoting or joining tables together. Then, they explore that data further with visualizations.

The data preparation capabilities are intuitive, enabling users to clean and transform data quickly and easily.

Analyze Data

QuickSight provides advanced analytics capabilities, such as forecasting and anomaly detection. These features allow users to gain a deeper understanding of their data and previously hidden insights and patterns. Users can personalize their data analysis through custom dashboards, reports and emails.

QuickSight’s forecasting feature uses ML algorithms to predict future trends based on historical data, as well as to spot unexpected and anomalous data patterns. You can also ask data-related questions through the Amazon QuickSight Q offering.

Sharing Visuals

Users can share visualizations internally or through website and application embedding. You control access and permissions, such as whether other users can download or edit visuals or export data. You also can receive alerts about changes in the data.

Maximizing the Value of Amazon QuickSight

Succeeding with Amazon QuickSight is easier with a trusted partner like Mission Cloud to fill gaps in skills and expertise, and unleash the full power of your data. Discover five reasons to work with an AWS partner for data, analytics and machine learning.

The post What Is Amazon QuickSight, and How Does It Uncover Insights? appeared first on The New Stack.

How LLMs Are Transforming Enterprise Applications

Ed Anuff — Thu, 08 Jun 2023 14:12:10 +0000

Artificial intelligence is the most transformative paradigm shift since the internet took hold in 1994. And it’s got a lot of corporations, understandably, scrambling to infuse AI into the way they do business.

One of the most important ways this is happening is via generative AI and large language models (LLMs), and it’s far beyond asking ChatGPT to write a post about a particular topic for a corporate blog or even to help write code. In fact, LLMs are rapidly becoming an integral part of the application stack.

Building generative AI interfaces like ChatGPT — “agents” — atop a database that contains all the data necessary and that can “speak the language” of LLMs is the future (and, increasingly, the present) of mobile apps. The level of dynamic interaction, access to vast amounts of public and proprietary data, and ability to adapt to specific situations make applications built on LLMs powerful and engaging in a way that’s not been available until recently.

And the technology has quickly evolved to the extent that virtually anyone with the right database and the right APIs can build these experiences. Let’s look at what’s involved.

Generative AI Revolutionizes the Way Applications Work

When some people hear “agent” and “AI” in the same sentence, they think about the simple chatbot that appears as a pop-up window that asks how it can help when they visit an e-commerce site. But LLMs can do much more than respond with simple conversational prompts and answers pulled from an FAQ. When they have access to the right data, applications built on LLMs can drive far more advanced ways to interact with us that deliver expertly curated information that is more useful, specific, rich — and often uncannily prescient.

Here’s an example:

You want to build a deck in your backyard, so you open your home-improvement store’s mobile application and ask it to build you a shopping list. Because the application is connected to an LLM like GPT-4 and many data sources (the company’s own product catalog, store inventory, customer information and order history, along with a host of other data sources), it can easily tell you what you’ll need to complete your DIY project. But it can do much more.

If you describe the dimensions and features you want to include in your deck, the application can offer visualization tools and design aids. Because it knows your postal ZIP code, it can tell you which stores within your vicinity have the items you need in stock. It can also, based on the data in your purchase history, suggest that you might need a contractor to help you with the job — and provide contact information for professionals near you.

The application could also tell you the amount of time it will take deck stain to dry (even including the seasonal climate trends for where you live) and how long it’ll be until you can actually have that birthday party on your deck that you’ve been planning. The application could also assist with and provide information on a host of other related areas, including details on project permit requirements and the effect of the construction on your property value. Have more questions? The application can help you at every step of the way as a helpful assistant that gets you where you want to go.

Using LLMs in Your Application Is Hard, Right?

This isn’t science fiction. Many organizations, including some of the largest DataStax customers, are working on many projects that incorporate generative AI.

But these projects aren’t just the realm of big, established enterprises; they don’t require vast knowledge about machine learning or data science or ML model training. In fact, building LLM-based applications requires little more than a developer who can make a database call and an API call. Building applications that can provide levels of personalized context that were unheard of until recently is a reality that can be realized with anyone who has the right database, a few lines of code and an LLM like GPT-4.

LLMs are very simple to use. They take context (often referred to as a “prompt”) and produce a response. So, building an agent starts with thinking about how to provide the right context to the LLM to get the desired response.

Broadly speaking, this context comes from three places: the user’s question, the predefined prompts created by the agent’s developer and data sourced from a database or other sources (see the diagram below).

A simple diagram of how an LLM gathers context to produce a response.

The context provided by the user is typically simply the question they input into the application. The second piece could be provided by a product manager who worked with a developer to describe the role the agent should play (for example, “you’re a helpful sales agent who is trying to help customers as they plan their projects; please include a list of relevant products in your responses”).

Finally, the third bucket of provided context includes external data pulled from your databases and other data sources that the LLM should use to construct the response. Some agent applications may make several calls to the LLM before outputting the response to the user in order to construct more detailed responses. This is what technologies such as ChatGPT Plug-ins and LangChain facilitate (more on these below).

Giving LLMs Memory

AI agents need a source of knowledge, but that knowledge has to be understandable by an LLM. Let’s take a quick step back and think about how LLMs work. When you ask ChatGPT a question, it has very limited memory or “context window.” If you’re having an extended conversation with ChatGPT, it packs up your previous queries and the corresponding responses and sends that back to the model, but it starts to “forget” the context.

This is why connecting an agent to a database is so important to companies that want to build agent-based applications on top of LLMs. But the database has to store information in a way that an LLM understands: as vectors.

Simply put, vectors enable you to reduce a sentence, concept or image into a set of dimensions. You can take a concept or context, such as a product description, and turn it into several dimensions: a representation of a vector. Recording those dimensions enables vector search: the ability to search on multidimensional concepts, rather than keywords.

This helps LLMs generate more accurate and contextually appropriate responses while also providing a form of long-term memory for the models. In essence, vector search is a vital bridge between LLMs and the vast knowledge bases on which they are trained. Vectors are the “language” of LLMs; vector search is a required capability of databases that provide them with context.

Consequently, a key component of being able to serve LLMs with the appropriate data is a vector database that has the throughput, scalability and reliability to handle the massive datasets required to fuel agent experiences.

… with the Right Database

Scalability and performance are two critical factors to consider when choosing a database for any AI/ML applications. Agents require access to vast amounts of real-time data and require high-speed processing, especially when deploying agents that might be used by every customer who visits your website or uses your mobile application. The ability to scale quickly when needed is paramount to success when it comes to storing data that feeds agent applications.

Apache Cassandra is a database that’s relied on by leaders like Netflix, Uber and FedEx to drive their systems of engagement, and AI has become essential to enriching every interaction that a business serves. As engagement becomes agent-powered, Cassandra becomes essential by providing the horizontal scalability, speed and rock-solid stability that makes it a natural choice for storing the data required to power agent-based applications.

For this reason, the Cassandra community developed the critical vector search capabilities to simplify the task of building AI applications on huge datasets, and DataStax has made these capabilities easily consumable via the cloud in Astra DB, the first petascale NoSQL database that is AI-ready with vector capabilities (read more about this news here).

How It Is Being Done

There are a few routes for organizations to create agent application experiences, as we alluded to earlier. You’ll hear developers talk about frameworks like LangChain, which as the name implies, enables the development of LLM-powered agents by chaining together the inputs and outputs of multiple LLM invocations and automatically pulling in the right data from the right data sources as needed.

But the most important way to move forward with building these kinds of experiences is to tap into the most popular agent on the globe right now: ChatGPT.

ChatGPT plugins enable third-party organizations to connect to ChatGPT with add-ons that serve up information on those companies. Think about Facebook. It became the social network platform, with a huge ecosystem of organizations building games, content and news feeds that could plug into it. ChatGPT has become that kind of platform: a “super agent.”

Your developers might be working on building your organization’s own proprietary agent-based application experience using a framework like LangChain, but focusing solely on that will come with a huge opportunity cost. If they aren’t working on a ChatGPT plugin, your organization will miss out on a massive distribution opportunity to integrate context that is specific to your business into the range of possible information ChatGPT can supply or actions it can recommend to its users.

A range of companies, including Instacart, Expedia, OpenTable and Slack have built ChatGPT plugins already; think about the competitive edge their integration with ChatGPT might create.

An Accessible Agent for Change

Building ChatGPT plug-ins will be a critical part of the AI-agent projects that businesses will look to engage in. Having the right data architecture — in particular, a vector database — makes it substantially easier to build very high-performance agent experiences that can quickly retrieve the right information to power those responses.

All applications will become AI applications. The rise of LLMs and capabilities like ChatGPT plugins is making this future much more accessible.

Want to learn more about vector search in Cassandra? Register for the June 15 webinar.

The post How LLMs Are Transforming Enterprise Applications appeared first on The New Stack.

How AlphaSense Added Generative AI to Its Existing AI Stack

Richard MacManus — Thu, 08 Jun 2023 13:00:17 +0000

AlphaSense has used artificial intelligence technologies for over ten years, as a way to compete against the likes of Bloomberg in the market intelligence industry. It recently launched its first generative AI application, Smart Summaries, which provides summaries of content such as earnings calls.

We spoke to VP of Product Chris Ackerson, who previously worked at IBM on the Watson project, about how AlphaSense introduced generative AI to its AI stack — and how the company intends to expand its product into a full-on AI personal assistant.

AlphaSense started as a platform to search for information in company-issued documents, such as regulatory filings and earnings calls. From the beginning, Ackerson said, AI was leveraged to support semantic search. As deep learning became more prominent in the 2010s, AlphaSense was able to organize and classify vast amounts of content using what he termed “pre-generative AI systems.”

“And then now with generative AI, we’re sort of feeding all of that structure and organization that we’ve created into these generative models, that aren’t just good at organizing or classifying content, but now can actually generate written language for users,” Ackerson said.

AlphaSense has been using language models in production for many years and is now developing its own large language models (LLMs). Since the introduction of BERT (Bidirectional Encoder Representations from Transformers) by Google in 2018, Ackerson says that AlphaSense has been leveraging the latest open source models and fine-tuning or training them on their financial content. It then optimizes the trained models for specific tasks, such as summarization or sentiment analysis.

Smart Summaries

AlphaSense released its first generative AI product, called “Smart Summaries,” in April. I asked Ackerson how customers are using this product.

One of the main benefits customers are experiencing with Smart Summaries is the ability to track a significantly larger number of companies, he replied. For example, hedge fund analysts may now be able to monitor 20 companies in their portfolio in a much more comprehensive manner.

He said that Smart Summaries achieves this by summarizing key information from various sources, such as earnings calls, sell-side research, and expert network interviews. For each company being tracked, AlphaSense provides users with a condensed overview — which Ackerson compared to a table of contents.

AlphaSense is working on other generative AI initiatives, to extend the functionality of Smart Summaries. Traditionally, he said that AlphaSense has been a pull-based experience, where users come to the product when they have a specific question or problem to solve. Based on generative AI, he believes they can evolve the product into more of a push-based experience. He wants AlphaSense to, eventually, proactively understand what is important to users, by analyzing top-end trends and automatically surfacing relevant information.

He added that it’s important to not just organize information for users, but also provide recommendations and personalization. Currently, Smart Summaries focuses on individual companies, but AlphaSense plans to extend that capability to summarizing trends across portfolios and sectors in the coming months.

Furthermore, AlphaSense is working towards enabling a more freeform chat-like interface, allowing users to interact with the system in a conversational manner. While the current system supports semantic search and the ability to ask questions, Ackerson noted, the company is investing in creating a more interactive dialogue, where users can have follow-up conversations with the system.

BloombergGPT

Needless to say, AlphaSense’s competitors are also investing heavily in generative AI. None more so than Bloomberg, which at the end of March announced BloombergGPT, an LLM for finance. Earlier in our conversation, Ackerson mentioned that AlphaSense is creating LLMs as well. I asked whether they will be something similar to what Bloomberg announced, or different.

“So we are developing multiple LLMs for various tasks in the product,” he replied. “What Bloomberg announced was one large model — it was a research paper, sort of replicating some of the GPT-3 approach to training a model.”

He added that AlphaSense hasn’t seen BloombergGPT in any products yet, so “from our perspective, it’s a research paper.”

According to Ackerson, both AlphaSense and Bloomberg are focusing on fine-tuning and training their models to understand the language and domain of finance and market intelligence. The difference, he said, is that “we’re very much focused on rolling out real products and real features built on top of those LLMs.”

For its part, Bloomberg has stated previously that it plans to integrate BloombergGPT into its terminal software.

What Developer Tools AlphaSense Uses

I noted my recent article on the use of tools like LangChain by developers, to help use LLMs in their applications. I asked Ackerson about AlphaSense’s developer stack for AI.

He first acknowledged that it is still the early days of building software tooling around LLMs. But he said that LangChain is one of the tools and frameworks that his team is evaluating for potential use. In general, the company is tracking improvements in efficiency and capabilities of open source models.

Ackerson pointed out a change in the market over the past six months. Initially, it was expected that large models from a few dominant players would prevail. However, he said now there is a consensus that smaller models can achieve comparable performance, but with the advantage of having full control of the stack.

“Not having to worry about data flowing back and forth over the internet, between APIs and things, is an enormous advantage,” he said. “And so we’re leaning heavily into leveraging what’s coming out of the open source community, and then obviously tuning and developing on top of those to support our needs.”

What about storing all the data that AlphaSense collects?

The company receives content from various sources, he explained — including sell-side research, company documents, their own expert network with transcribed interviews, and news feeds. It has hundreds of millions of documents, which it stores in a combination of document search and vector databases. He added that this hybrid approach enables support for different workflows, including document search and Q&A chat functionalities.

Data Integrity

I brought up the issue of hallucination in generative AI and asked how AlphaSense deals with this in its products.

Ackerson explained that AlphaSense takes a different approach compared to consumer-focused generative AI solutions. Firstly, it curates the content that flows into AlphaSense, ensuring that it comes from authoritative sources. Secondly, it places emphasis on auditability, by allowing users to see where an answer came from.

Lastly, AlphaSense leverages the structured data it has extracted from documents, such as key performance indicators (KPIs) from earnings calls. This structured data allows them to compare and validate the accuracy of the generative AI answers.

“So with those validation steps, we’re able to reduce that hallucination risk down to extremely minimal,” he said.

The Future: AI Personal Assistants

As for the future of generative AI in his company and industry, Ackerson said that in the short to medium term, it will lower the barriers to entry for knowledge professionals who need quick access to insights contained in complex documents. Looking further ahead, he sees generative AI evolving into intelligent assistants that help plan and organize a user’s days.

AlphaSense is already exploring the concept of providing personalized morning briefings, by summarizing relevant topics and companies of interest for a user. Eventually, he said, a user will be able to give the AI system tasks and instructions throughout the day, beyond just answering questions.

You will be able to interact with your AI assistant, Ackerson said, “much like you would a [human] assistant or an intern.”

The post How AlphaSense Added Generative AI to Its Existing AI Stack appeared first on The New Stack.

Sundeck Launches Query Engineering Platform for Snowflake

Andrew Brust — Thu, 08 Jun 2023 12:00:06 +0000

Sundeck, a new company led by one of the co-founders of Dremio, recently launched a public preview of its eponymous SaaS “query engineering” platform. The platform, which Sundeck says is built for data engineers, analysts and database administrators (DBAs), will initially work with Snowflake‘s cloud data platform. Sundeck will be available free of charge during the public preview; afterward, the company says it will offer “simple” pricing, including both free and premium tiers.

Sundeck (the product) is built atop an Apache-licensed open source project called Substrait, though it offers much additional functionality and value. Sundeck (the company) has already closed a $20M seed funding round, with participation from venture capital firms Coatue, NEA and Factory.

What Does It Do?

Jacques Nadeau, formerly CTO at Dremio and one of its co-founders, briefed the New Stack and explained in depth how Sundeck query engineering works. Nadeau also described a number of Sundeck’s practical applications.

Basically, Sundeck sits between business intelligence (BI)/query tools on the one hand, and data sources (again, just Snowflake, to start) on the other. It hooks into the queries and can dynamically rewrite them. It can also hook into and rewrite query results.

Sundeck “hooks” (bottom, center) insinuate themselves in the query path between data tools (on the left) and the data source (on the right). Credit: Sundeck

One immediate benefit of the query hook approach is that it lets customers optimize the queries with better SQL than the tools might generate. By inspecting queries and looking for specific patterns, Sundeck can find inefficiencies and optimize them on-the-fly, without requiring users, or indeed BI tools, to do so themselves.

Beyond Query Optimization

More generally, though, Sundeck lets customers evaluate rules and take actions. The rules can be based on the database table(s) being queried, the user persona submitting the query or even properties of the underlying system being queried. This lets Sundeck do anything from imposing usage quotas (and thus controlling Snowflake spend); to redirecting queries to different tables or a different data warehouse; rejecting certain high-cost queries outright; reducing or reshaping a result set; or kicking off arbitrary processes.

In effect, Sundeck takes the call-and-response pipeline between client and database and turns it into an event-driven service platform, with a limitless array of triggers and automated outcomes. But that’s not to say Sundeck does this in some generic compute platform-like fashion. Instead, it’s completely contextual to databases, using Snowflake’s native API.

Must read:

With that in mind, we could imagine other applications for Sundeck, including observability/telemetry analytics, sophisticated data replication schemes and even training of machine learning models, using queries and/or result sets as training or inferencing data. Data regulation compliance, data exfiltration prevention, and responsible AI processes are other interesting applications for Sundeck. Apropos of that, Sundeck says its private result path technology ensures data privacy and that its platform is already SOC 2-certified.

In the Weeds

If all of this sounds a bit geeky, that would genuinely seem to be by design. Sundeck’s purpose here was to provide a user base — that already works at a technical level — access to the query pipeline, which heretofore has largely been a black box. This user audience is already authoring sophisticated data transformation pipelines with platforms like dbt, so why not let them transform queries as well?

It’s no surprise that Sundeck is a product that lives deep in the technology stack. After all, Nadeau previously led similarly infrastructural open source projects like Apache Arrow, which provides a unified standard for storing columnar data in memory (and which Nadeau says is an important building block in Snowflake’s platform), and Apache Drill, which acts as a SQL federated query broker. The rest of the fifteen-person Sundeck team has bona fides similar to Nadeau’s, counting 10 Apache project management committee (PMC) leaders, and even co-founders of Apache projects, like Calcite and Phoenix, among its ranks.

Check out:

Sunny Forecast on Deck?

If data is the lifeblood of business, then query pathways are critical arteries in a business’ operation. As such, being able to observe and intercept queries, then optimize them or automate processes in response to them, seems like common sense. If Sundeck can expand to support the full array of major cloud data warehouse and lakehouse platforms, query engineering could catch on and an ecosystem could emerge.

The post Sundeck Launches Query Engineering Platform for Snowflake appeared first on The New Stack.

DataStax Adds Vector Search to Astra DB on Google Cloud

Chris J. Preimesberger — Wed, 07 Jun 2023 17:07:25 +0000

With so much data piling up everywhere, loaded database nodes are becoming a serious challenge for users to search faster and more accurately to find what they are seeking.

DataStax, which makes a real-time database cloud service built upon open source Apache Cassandra, announced today that its Database as a Service (DBaaS), Astra DB, now supports vector search. This is fast becoming an essential capability for enabling databases to provide long-term memory for AI applications using large language models (LLMs) and other AI use cases.

DataStax is working with the Google Cloud AI/ML Center of Excellence as part of the Built with Google AI program to enable Google Cloud’s generative AI offerings to improve the capabilities of customers using DataStax.

Vector search can be difficult to explain to non-mathematics-type people. It uses machine learning to convert unstructured data, such as text and images, into a numeric representation within the database called a vector. This vector representation captures the meaning and context of the data, allowing for more accurate and relevant search results. It also is able to recognize and connect similar vectors in the database within the context of the query in order to produce more accurate results.

Vector search is often used for semantic search, a type of search that looks for items that are related in meaning, rather than just those that contain the same keywords. For example, a vector search engine could be used to find songs that are similar to a user’s favorite song, even if they don’t share any of the same keywords.

‘Vector Search Is Magic’

“Vector search is magic because it understands what you meant vs. what you said (in a query),” DataStax CPO Ed Anuff told The New Stack. “The more complex a piece of content is, turning it into a vector becomes a much more efficient way of finding this similarity without having to try to guess which keywords are (exactly) right.

“Let’s imagine that I have a database of all of the articles you’ve written. The process of turning each one of your articles into a vector is done through an LLM (large language model), and it looks through the entirety of each article. It figures out what are the most important pieces of an article, and the vector that it produces gets to the essence of it in a concise way. For example, even though you might have used the word ‘Cassandra’ many times in an article, it knows the LLM when it transforms into the vector. It knows that your article is about an open-source database – not about the Cassandra constellation or a performance artist named Cassandra,” Anuff said.

Developers create vectors with simple API calls, and they query those vectors on simple API calls. “But they can now put this powerful capability to work. So that’s why vectorization is such a powerful aspect of this,” Anuff said.

Some of the benefits of using vector databases include:

Scalability: They can scale to handle large amounts of data.
Flexibility: They can be used to store and manage a variety of data types, including structured, unstructured and semi-structured data.
Performance: They can provide high performance for queries on large datasets.

Vector search is also used for image search. In this case, the vectors represent the features of an image, such as its color, texture, and shape. This allows for more accurate and relevant image search results, such as finding images that are similar to a user-uploaded image.

DataStax is launching the new vector search tool and other new features via a NoSQL copilot — a Google Cloud Gen AI-powered chatbot that helps DataStax customers develop AI applications on Astra DB. DataStax and Google Cloud are releasing CassIO, an open source plugin to LangChain that enables Google Cloud’s Vertex AI service to combine with Cassandra for caching, vector search, and chat history retrieval.

Designed for Real-Time AI Projects

Coming on the heels of the introduction of vector search into Cassandra, the availability of this new tool in the pay-as-you-go Astra DB service is designed to enable developers to leverage the massively scalable Cassandra database for their LLM, AI assistant, and real-time generative AI projects, Anuff said.

“Vector search is a key part of the new AI stack; every developer building for AI needs to make their data easily queryable by AI agents,” Anuff said. “Astra DB is not only built for global scale and availability, but it supports the most stringent enterprise-level requirements for managing sensitive data including HIPAA, PCI, and PII regulations. It’s an ideal option for both startups and enterprises that manage sensitive user information and want to build impactful generative AI applications.”

Vector search enables developers to search by using “embeddings”; for example, Google Cloud’s API for text embedding, which can represent semantic concepts as vectors to search unstructured datasets, such as text and images. Embeddings are tools that enable search in natural language across a large corpus of data, in different formats, in order to extract the most relevant pieces of data.

New Capabilities in the Tool

In addition, DataStax has partnered with Google Cloud on several new capabilities:

CassIO: The CassIO open source library enables the addition of Cassandra into popular generative AI SDKs such as LangChain.
Google Cloud BigQuery Integration: New integration enables Google Cloud users to seamlessly import and export data from Cassandra into BigQuery straight from their Google Cloud Console to create and serve ML features in real time.
Google Cloud DataFlow Integration: New integration pipes real-time data to and from Cassandra for serving real-time features to ML models, integrating with other analytics systems such as BigQuery, and real-time monitoring of generative AI model performance.

Goldman Sachs Research estimates that the generative AI software market could grow to $150 billion, compared to $685 billion for the global software industry.

Vector search is available today as a non-production use public preview in the serverless Astra DB cloud database. It will initially be available exclusively on Google Cloud, with availability on other public clouds to follow. Developers can get started immediately by signing up for Astra.

The post DataStax Adds Vector Search to Astra DB on Google Cloud appeared first on The New Stack.

Creating an IoT Data Pipeline Using InfluxDB and AWS

Jason Myers — Mon, 05 Jun 2023 17:25:11 +0000

The Internet of Things (IoT) and operations technology (OT) spaces are awash in time series data. Sensors churn out time-stamped data at a steady and voluminous pace, and wrangling all that data can be a challenge. But you don’t want to collect a bunch of data and let it just sit there, taking up storage space without providing any value.

So, in the spirit of gaining value from data for IoT/OT, let’s look at how to configure a sensor using InfluxDB and Amazon Web Services (AWS) to collect the type of time-stamped data generated by industrial equipment. You can adapt, expand and scale the concepts here to meet the needs of full industrial production needs. Here’s what we’ll use in the example and why.

InfluxDB: This time series database just released an updated version that outperforms previous versions in several key areas, such as query performance against high cardinality data, compression and data ingest. Critically for this example, InfluxDB’s cloud products are available in AWS. Having everything in the same place reduces the number of data transfers and data-latency issues in many instances.
Telegraf: Telegraf is an open source, plugin-based data-collection agent. Lightweight and written in Go, you can deploy Telegraf instances anywhere. With more than 300 available plugins, Telegraf can collect data from any source. You can also write custom plugins to collect data from any source that doesn’t already have a plugin.
AWS: Amazon Web Services has a whole range of tools and services geared toward IoT. We can tap into some of these services to streamline data processing and analysis.
M5stickC+: This is a simple IoT device that can detect a range of measurements, including position, acceleration metrics, gyroscope metrics, humidity, temperature and pressure. This device provides multiple data streams, which is similar to what industrial operators face with manufacturing equipment.

InfluxDB and AWS IoT in Action

The following example illustrates one of many possible data pipelines that you can scale to meet the needs of a range of industrial equipment. That may include machines on the factory floor or distributed devices in the field. Let’s start with a quick overview; then we’ll dive into the details.

The basic data flow is:

Device → AWS IoT Core → MQTT (rules/routing) → Kinesis → Telegraf → InfluxDB → Visualization

The M5stickC+ device is set up to authenticate directly with AWS. The data comes into AWS IoT Core, which is an AWS service that uses MQTT to add the data to a topic that’s specific to the device. Then there’s a rule that selects any data that goes through the topic and redirects it to the monitoring tool, Amazon Kinesis.

A virtual machine running in AWS also runs two Docker containers. One contains an instance of Telegraf and the other an instance of InfluxDB. Telegraf collects the data stream from Kinesis and writes it to InfluxDB. Telegraf also uses a DynamoDB table as a checkpoint in this setup so if the container or the instance goes down, it will restart at the correct point in the data stream when the application is up again.

Once the data is in InfluxDB, we can use it to create visualizations.

OK, so that’s the basic data pipeline. Now, how do we make it work?

Device Firmware

The first step is creating a data connection from the M5 device to AWS. To accomplish this we use UI Flow, a drag-and-drop editor that’s part of the M5 stack. Check out the blocks in the image below to get a sense for what the device is collecting and how these things map to our eventual output.

We can see here that this data publishes to the MQTT topic m5sticks/MyThingGG/sensors.

AWS IoT Core Rules

With the device data publishing to an MQTT broker, next we need to subscribe to that topic in AWS IoT Core. In the topic filter field, enter m5sticks/+/sensors to ensure that all the data from the device ends up in the MQTT topic.

Next, we need to create another rule to ensure that the data in the MQTT topic goes to Kinesis. In IoT Core, you can use a SQL query to accomplish this:

SELECT *, topic(2) as thing, 'sensors' as measurement, timestamp() as timestamp FROM 'm5sticks/+/sensors'

In an industrial setting, each device should have a unique name. So, to scale this data pipeline to accommodate multiple devices, we use the +` wildcard in the MQTT topic to ensure that all the data from all devices end up in the correct place.

This query adds a timestamp to the data so that it adheres to line protocol, InfluxDB’s data model.

Telegraf Policy

Now that data is flowing from the device to Kinesis, we need to get it into InfluxDB. To do that, we use Telegraf. The code below is the policy that determines how Telegraf interacts with Kinesis. It enables Telegraf to read from the Kinesis stream and enables read and write access to DynamoDB for checkpointing.

{
	"Version": "2012-10-17"
	"Statement": [
		{
			"Sid": "AllowReadFromKinesis",
			"Effect": "Allow",
			"Action": [
				"kinesis:GetShardIterator",
				"kinesis:GetRecords",
				"kinesis:DescribeStream"
			],
			"Resource": [
				"arn:aws:kinesis:eu-west-3:xxxxxxxxx:stream/InfluxDBStream"
			]
		},
		{
			"Sid": "AllowReadAndWriteDynamoDB",
			"Effect": "Allow",
			"Action": [
				"dynamodb:PutItem",
				"dynamodb:GetItem"
			],
			"Resource": [
				"arn:aws:kinesis:eu-west-3:xxxxxxxxx:table/influx-db-telegraf"
			]		
		}
	]
}

Telegraf Config

The following Telegraf config uses Docker container networking, the Telegraf Kinesis Consumer plugin to read the data from Kinesis and the InfluxDB v2 output plugin to write that data to InfluxDB. Note that the string fields match the values from the device firmware UI.

[agent]
debug = true

[[outputs.influxdb_v2]]

## The URLs of the InfluxDB Cluster nodes.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
## urls exp: http://127.0.0.1:8086
urls = ["http://influxdb:8086"]

## Token for authentication.
token = "toto-token"

## Organization is the name of the organization you wish to write to; must exist.
organization = "aws"

## Destination bucket to write into.
bucket = "toto-bucket"


[[inputs.kinesis_consumer]]
## Amazon REGION of kinesis endpoint.
region = "eu-west-3"

## Amazon Credentials
## Credentials are loaded in the following order
## 1) Web identity provider credentials via STS if role_arn and web_identity_token_file are specified
## 2) Assumed credentials via STS if role_arn is specified
## 3) explicit credentials from 'access_key' and 'secret_key'
## 4) shared profile from 'profile'
## 5) environment variables
## 6) shared credentials file
## 7) EC2 Instance Profile

## Endpoint to make request against, the correct endpoint is automatically
## determined and this option should only be set if you wish to override the
## default.
##   ex: endpoint_url = "http://localhost:8000"
# endpoint_url = ""

## Kinesis StreamName must exist prior to starting telegraf.
streamname = "InfluxDBStream"

## Shard iterator type (only 'TRIM_HORIZON' and 'LATEST' currently supported)
# shard_iterator_type = "TRIM_HORIZON"

## Max undelivered messages
## This plugin uses tracking metrics, which ensure messages are read to
## outputs before acknowledging them to the original broker to ensure data
## is not lost. This option sets the maximum messages to read from the
## broker that have not been written by an output.
##
## This value needs to be picked with awareness of the agent's
## metric_batch_size value as well. Setting max undelivered messages too high
## can result in a constant stream of data batches to the output. While
## setting it too low may never flush the broker's messages.
# max_undelivered_messages = 1000

## Data format to consume.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
data_format = "json"

## Tag keys is an array of keys that should be added as tags.
tag_keys = [
	"thing"
]

## String fields is an array of keys that should be added as string fields.
json_string_fields = [
	"pressure",
	"xGyr",
	"yAcc",
	"batteryPower",
	"xAcc",
	"temperature",
	"zAcc",
	"zGyr",
	"y",
	"x",
	"yGry",
	"humidity"

]

## Name key is the key used as the measurement name.
json_name_key = "measurement"

## Time key is the key containing the time that should be used to create the
## metric.
json_time_key "timestamp"

## Time format is the time layout that should be used to interpret the
## json_time_key. The time must be 'unix', 'unix_ms' or a time in the 
## "reference_time".
##   ex: json_time_format = "Mon Jan 2 15:04:05 -0700 MST 2006"
##       json_time_format = "2006-01-02T15:04:05Z07:00"
##       json_time_format = "unix"
##       json_time_format = "unix_ms"
json_time_format = "unix_ms"

## Optional
## Configuration for a dynamodb checkpoint
[inputs.kinesis_consumer.checkpoint_dynamodb]
## unique name for this consumer
app_name = "default"
table_name = "influx-db-telegraf"

Docker Compose

This Docker compose file gets uploaded to an EC2 instance using SSH.

version: '3'

services:
  influxdb:
    image: influxdb:2.0.6
    volumes:
      # Mount for influxdb data directory and configuration
      - influxdbv2:/root/.influxdbv2
    ports:
      - "8086:8086"
# Use the influx cli to set up an influxdb instance
  influxdb_cli:
    links:
      - influxdb
    image: influxdb:2.0.6
# Use these same configurations parameters in you telegraf configuration, mytelegraf.conf
	# Wait for the influxd service in the influxdb has been fully bootstrapped before trying to set up an influxdb
   restart: on-failure:10
   depends_on:
   	- influxdb
  telegraf:
    image: telegraf:1.25-alpine
    links:
      - influxdb
    volumes:
      # Mount for telegraf config
      - ./telegraf.conf:/etc/telegraf/telegraf.conf
    depends_on:
      - influxdb_cli

volumes:
  influxdbv2:

Visualization

Once your data is in InfluxDB, users running InfluxDB can create visualizations and dashboards using your tool of choice. InfluxDB offers a native integration with Grafana and supports SQL queries using Flight SQL-compatible tools.

Conclusion

Building reliable data pipelines is a critical aspect of industrial operations. This data provides key information about the current state of machinery and equipment, and can train predictive models used to improve machinery operations and effectiveness. Combining leading technologies like InfluxDB and AWS provides the tools necessary to capture, store, analyze and visualize this critical process data.

The use case described here is just one way to build a data pipeline. You can update, scale and amend it to accommodate a wide range of industrial IoT/OT operations or build something completely custom using the range of IoT solutions available for InfluxDB and AWS.

The post Creating an IoT Data Pipeline Using InfluxDB and AWS appeared first on The New Stack.

Dealing with Death: Social Networks and Modes of Access

David Eastman — Sat, 03 Jun 2023 13:00:17 +0000

One increasingly common problem faced by social networks is what to do about death. Getting access to an account of a deceased friend or relative usually has at least three parts, depending on the territory:

Get a copy of the death certificate;
Get a letter of testamentary (both tech companies and financial institutions will request that you not only prove that the person is dead but also that you have a legal right to access their accounts.)
Reach out to the platform.

This is all quite unreasonable, just to put a sticky note on the avatar page explaining why the deceased user is no longer responding. Waiting for a death certificate and other lawyerly speed activities just adds to the misery. Social media companies are not (and don’t want to be) secondary recorders of deaths; indeed we know that accounts regularly represent entities that were never alive in the first place.

What is really missing here, and what this article looks at, are different modes of access, as part of a fully functional platform. Designers need to create alternative and systematic access methods that help solve existing scenarios without having to hack their own systems.

The Case for Backdoors

The focus on security has unbalanced digital fortresses that now regard their own users’ accounts as potential risks. The term backdoor was intended to imply an alternative access route, but now simply means something to be boarded up tight at the next patch, before a security inquest. This has the unfortunate consequence of limiting the options for users.

In the early days of computing, when software was still distributed by floppy disks, people updated their applications a lot less, and alternative access to fix errors or make minor changes was quite normal. Magazines were full of cheats, hacks and hints. Some authorised, some not. Before the full suite of integrated testing became available, backdoors were often added by developers to test certain scenarios for an application. Today, we are no longer encouraged to think that we own running software at all, and that has changed how we think about accessing it.

In the example of a deceased user of a social media platform, the most straightforward solution is for a third-party legal company to hold a key in escrow. That company would then be charged with communicating with concerned humans. However, the ‘key’ would not allow a general login — it would purely be used to suspend an account, or to insert a generic account epitaph. So the third party concentrates on its role of soberly talking to friends, relatives or possibly other lawyers, while the platform can just maintain its services. (And yes, that could also mean police services could halt an account without having to negotiate with the social media company.) The agreement could be required to be set up after the account had crossed a size or time alive threshold. From a development point of view, the special access would need to be detected, and a confirmation that the account had indeed be suspiciously quiet.

Launching a Nuke

You may have seen the familiar dramatic film device where two people have to turn their keys to launch a nuclear missile, or open a safe. It is a trope even used by Fortnite.

From RFE/RL

The two-man rule is a real control mechanism designed to achieve a high level of security for critical operations. Access requires the presence of two or more authorised people. If we just step back a bit, it is just a multi-person access agreement. Could this be useful elsewhere?

Returning to examples on social media, I’ve seen a number of times when a friend has said something relatively innocent on Twitter, stepped on a plane, only to turn his network back on to discover a tweet that has become controversial. What if his friends could temporarily hide the tweet? Like the missile launch, it would need two of more trusted users to act together. Again, the point here is to envision alternative access methods that could be coded against. Given that the idea is to help the user while they are temporarily incapacitated, the user can immediately flip any action simply by logging back on.

The only extra required concept here is the definition of a set of trusted friendly accounts, any of whom the user may feel “has their back.” In real life this is pretty normal, even though we still envision social media accounts as existing in a different time and space. In fact, you might imagine that a user who can’t trust any other accounts probably isn’t suitable to be on social media.

Implementing this concept would require defining a time period after which a friendly intervention could be considered, and a way to check that the required quorum triggered the intervention at roughly the same time. One imagines that once you become a designated friend of another user account, the option to signal concern would appear somewhere in the settings of their app. This is certainly a more complex set of things to check than standard access, and it could well produce its own problems in time.

Both using a third party escrow key, or relying on a group of friendly accounts defines a three-way trust system that should be a familiar way to distribute responsibility. This is how a bank, a merchant and a buyer complete a purchase transaction. Testing these systems is similar in nature. First acknowledge the identity of the parties, then confirm that they have permission to perform the action, and finally confirm the action is appropriate at the time.

Negative Intervention

A natural variation on a third party intervention where the authorised user is incapacitated, is where a third party wants to stop an account because they think it has been hacked or stolen. The obvious difference here is that the current user cannot be allowed to simply cancel the action. Social media companies may close a suspicious account down eventually, but there doesn’t seem to be a systematic way to do this independently by users.

This is a harder scenario to implement, as it needs a way for the authentic user to resolve the situation one way or another. Social media companies do, of course, keep alternative contact details for their users. Hence the user could signal that all is well; the account really has been taken; or the account was taken but has now been recovered. But until that happens, the account is in a slightly strange state — under suspicion, yet not officially so. Should the account be trusted? Perhaps the friends themselves are not themselves?

Get Back In

If you feel the examples above are odd, you shouldn’t. They are really just extensions of what happens when, in real life, you lock yourself out of your home and fetch a spare key from your neighbour — or ask the police not to arrest you when you smash your own window to get back in. While platforms need to regard their users with less suspicion and provide more access schemes, developers also need to experiment with innovative access styles. (Actual security breaches are often caused by disgruntled staff selling sensitive data.)

There is no question that AI could help make certain assessments — the things that have been mentioned throughout this article. Is an account acting suspiciously? Has it been quiet longer than usual? Has a two-man rule been activated? Orchestration of edge case scenarios is something that AI might well be successful with, as well.

Maybe with the help of GPT and more experimentation, users may find that recovery from uncommon but unfortunate scenarios will be less fraught in the future.

The post Dealing with Death: Social Networks and Modes of Access appeared first on The New Stack.

Bringing AI to the Data Center

Ed Anuff — Thu, 01 Jun 2023 13:13:19 +0000

With all the assumptions we make about the advancements in enterprise data and cloud technologies, there’s a plain fact that often gets overlooked: The majority of the most important enterprise data remains in the corporate data center.

There are plenty of reasons for this — some reasonable, some not so much. In some cases, it’s because of the highly sensitive nature of data, whether it’s HIPAA compliance, sensitive banking data or other privacy concerns. In other cases, the data resides in systems (think legacy enterprise resource-planning data or petabyte-scale scientific research data) that are difficult to move to the cloud. And sometimes it’s just inertia. It’s not a great excuse, but it happens all the time.

Whatever the reason, housing data on racks of corporate servers has proved to be a real hindrance to many enterprises’ ability to take advantage of AI to transform their business, because it’s been all but impossible to provide the significant compute power necessary to drive AI on the infrastructure underpinning most data centers.

But there’s a movement under way, via a small constellation of startups and big device makers, to optimize machine learning models and make AI available to companies whose data isn’t in the cloud. It’s going to be a game changer.

The Processing-Power Problem

The graphical processing unit, or GPU, was developed to handle high-intensity video-processing applications like those required by modern video games and high-resolution movies. But the ability of these processors to break down complex tasks into smaller tasks and execute them in parallel also makes these high-powered application-specific integrated circuits (ASICs) very useful for artificial intelligence. AI, after all, requires massive streams of data to refine and train machine learning models.

CPUs, on the other hand, are the flexible brains of servers, and, as such, they are built to handle a wide variety of operations, like accessing hard-drive data or moving data from cache to storage, but they lack the ability to do these tasks in parallel (multicore processors can handle parallel tasks, but not at the level of GPUs). They simply aren’t built to handle the kind of high-throughput workloads that AI demands.

High-performance GPUs are very expensive and until recently, they’ve been scarce, thanks to the crypto miners’ reliance on these high-performance chips. For the most part, they’re the realm of the cloud providers.

Indeed, high-performance computing services are a big reason companies move their data to the cloud. Google’s Tensor Processing Unit, or TPU, is a custom ASIC developed solely to accelerate machine learning workloads. Amazon also has its own chips for powering AI/ML workloads.

Optimizing for AI

GPUs have been the foundation of the rush of AI innovation that has recently taken over the headlines. Much of these high-profile developments have been driven by companies pushing the envelope on what’s possible without thinking too much about efficiency or optimization. Consequently, the workloads produced by new AI tools have been massive, and so, by necessity, managed in the cloud.

But in the past six months or so, that’s been changing. For one thing, the sprawling ML models that drive all of these cutting-edge AI tools are getting condensed significantly, but they are still generating the same powerful results.

For example, I installed the Vicuna app on my mobile phone. It’s a 13-billion-parameter model that does ChatGPT-like execution and runs in real time, right on my phone. It’s not in the cloud at all — it’s an app that resides on a device.

The Vicuna project emerged from the Large Model Systems Organization, a collaboration between the University of California, Berkeley, the University of California, Davis and Carnegie Mellon University that seeks to “make large models accessible to everyone by co-development of open datasets, models, systems and evaluation tools.”

It’s a mission that big tech isn’t ignoring. Apple’s latest desktops and iPhones have specialized processing capabilities that accelerate ML processes. Google and Apple are doing a lot of work to optimize their software for ML too.

There’s also a ton of talented engineers at startups that are working to make hardware more performant in a way that makes AI/ML more accessible.

A great example is ThirdAI, which offers a software-based engine that can train large deep-learning models by using CPUs. DataStax has been experimenting with the ThirdAI team for months and has been impressed with what they have developed — so much so that last week we announced a partnership with the company to make sophisticated large language models (LLMs) and other AI technologies accessible to any organization, regardless of where their data resides. (Read more about the partnership news here.)

Bring AI to the Data

Because of all this hard work and innovation, AI will no longer be available exclusively to organizations with data in the cloud. This is extremely important to privacy, which is a big reason many organizations keep their data on their own servers.

With the AI transformation wave that has washed over everything in the past 18 months or so, it’s all about data. Indeed, there is no AI without data, wherever it might reside. Efforts by teams like ThirdAI also enable all organizations to “bring AI to the data.”

For a long time, companies have been forced to do the opposite: bring their data to AI. They had to dedicate massive resources, time and budget to migrate data from data warehouses and data lakes to dedicated machine learning platforms before analyzing for key insights.

This results in significant data transfer costs, and the required time to migrate, analyze and migrate affects how quickly organizations can learn new patterns and take action with customers in the moment.

Bringing AI to the data is something we have focused on a lot at DataStax with our real-time AI efforts, because it’s the fastest way to take action based on ML/AI, delight customers and drive revenue. Bringing AI to the data center, and not just the cloud, is another important step to making the transformational AI technology wave something all companies can be a part of.

Learn about the new DataStax AI Partner Program, which connects enterprises with groundbreaking AI startups to accelerate the development and deployment of AI applications for customers.

The post Bringing AI to the Data Center appeared first on The New Stack.

8 Real-Time Data Best Practices

Jennifer Riggins — Thu, 01 Jun 2023 12:00:14 +0000

More than 300 million terabytes of data are created every day. The next step in unlocking the value from all that data we’re storing is being able to act on it almost instantaneously.

Real-time data analytics is the pathway to the speed and agility that you need to respond to change. Real-time data also amplifies the challenges of batched data, just continuously, at the terabyte scale.

Then, when you’re making changes or upgrades to real-time environments, “you’re changing the tires on the car, while you’re going down the road,” Ramos Mays, CTO of Semantic-AI, told The New Stack.

How do you know if your organization is ready for that deluge of data and insights? Do you have the power, infrastructure, scalability and standardization to make it happen? Are all of the stakeholders at the planning board? Do you even need real time for all use cases and all datasets?

Before you go all-in on real time, there are a lot of data best practices to evaluate and put in place before committing to that significant cost.

1. Know When to Use Real-Time Data

Just because you can collect real-time data, doesn’t mean you always need it. Your first step should be thinking about your specific needs and what sort of data you’ll require to monitor your business activity and make decisions.

Some use cases, like supply chain logistics, rely on real-time data for real-time reactions, while others simply demand a much slower information velocity and only need analysis on historical data.

Most real-time data best practices come down to understanding your use cases up front because, Mays said, “maintaining a real-time infrastructure and organization has costs that come alongside it. You only need it if you have to react in real time.

“I can have a real-time ingestion of traffic patterns every 15 seconds, but, if the system that’s reading those traffic patterns for me only reads it once a day, as a snapshot of only the latest value, then I don’t need real-time 15-second polling.”

Nor, he added, should he need to support the infrastructure to maintain it.

Most companies, like users of Semantic-AI, an enterprise intelligence platform, need a mix of historical and real-time data; Mays’ company, for instance, is selective about when it does and doesn’t opt for using information collected in real time.

He advises bringing together your stakeholders at the start of your machine learning journey and ask: Do we actually need real-time data or is near-real-time streaming enough? What’s our plan to react to that data?

Often, you just need to react if there’s a change, so you would batch most of your data, and then go for real time only for critical changes.

“With supply chain, you only need real-time time if you have to respond in real time,” Mays said. “I don’t need real-time weather if I’m just going to do a historic risk score, [but] if I am going to alert there’s a hurricane through the flight path of your next shipment [and] it’s going to be delayed for 48 hours, you’re reacting in real time.”

2. Keep Data as Lightweight as Possible

Next you need to determine which categories of data actually add value by being in real time in order to keep your components lightweight.

“If I’m tracking planes, I don’t want my live data tracking system to have the flight history and when the tires were last changed,” Mays said. “I want as few bits of information as possible in real time. And then I get the rest of the embellishing information by other calls into the system.”

Real-time data must be designed differently than batch, he said. Start thinking about where it will be presented in the end, and then, he recommended, tailor your data and streams to be as close to your display format as possible. This helps determine how the team will respond to changes.

For example, if you’ve got customers and orders, one customer can have multiple orders. “I want to carry just the amount of information in my real-time stream that I need to display to the users,” he said, such as the customer I.D. and order details. Even then, you will likely only show the last few orders in live storage, and then allow customers to search and pull from the archives.

For risk scoring, a ground transportation algorithm needs real-time earthquake information, while the aviation algorithm needs real-time wind speed — it’s rare that they would both need both.

Whenever possible, Mays added, only record deltas — changes in your real-time data. If your algorithm is training on stock prices, but those only change every 18 seconds, you don’t need it set for every quarter second. There’s no need to store those 72 data points across networks when you could only send one message when the value changes. This in turn reduces your organizational resource requirements and focuses again on the actionable.

3. Unclog Your Pipes

Your data can be stored in the RAM of your computer, on disk or in the network pipe. Reading and writing everything to the disk is the slowest. So, Mays recommended, if you’re dealing in real-time systems, stay in memory as much as you can.

“You should design your systems, if at all possible, to only need the amount of data to do its thing so that it can fit in memory,” he said, so your real-time memory isn’t held up in writing and reading to the disk.

“Computer information systems are like plumbing,” Mays said. “Still very mechanical.”

Think of the amount of data as water. The size of pipes determines how much water you can send through. One stream of water may need to split into five places. Your pipes are your network cables or, when inside the machine, the I/O bus that moves the data from RAM memory to the hard disk. The networks are the water company mainlines, while the bus inside acts like the connection between the mainlines and the different rooms.

Most of the time, this plumbing just sits there, waiting to be used. You don’t really think about it until you are filling up your bathtub (RAM.) If you have a hot water heater (or a hard drive), it’ll heat up right away; if it’s coming from your water main (disk or networking cable), it takes time to heat up. Either way, when you have finished using the water (data) in your bathtub (RAM) it drains and is gone when you’re done with it.

You must have telemetry and monitoring, Mays said, extending the metaphor, because “we also have to do plumbing while the water is flowing a lot of times. And if you have real-time systems and real-time consumers, you have to be able to divert those streams or store them and let it back up or to divert it around a different way,” while fixing it, in order to meet your service-level agreement.

4. Look for Outliers

As senior adviser for the Office of Management, Strategy and Solutions at the U.S. Department of State, Landon Van Dyke oversees the Internet of Things network for the whole department — including, but not limited to, all sensor data, smart metering, air monitoring, and vehicle telematics across offices, embassies and consulates, and residences. Across all resources and monitors, his team exclusively deals in high-frequency, real-time data, maintaining two copies of everything.

He takes a contrary perspective to Mays and shared it with The New Stack. With all data in real time, Van Dyke’s team is able to spot crucial outliers more often, and faster.

“You can probably save a little bit of money if you look at your utility bill at the end of the month,” Van Dyke said, explaining why his team took on its all-real-time strategy to uncover better patterns at a higher frequency. “But it does not give you the fidelity of what was operating at three in the afternoon on a Wednesday, the third week of the month.”

The specificity of energy consumption patterns are necessary to really make a marked difference, he argued. Van Dyke’s team uses that fidelity to identify when things aren’t working or something can be changed or optimized, like when a diplomat is supposed to be away but the energy usage shows that someone has entered their residence without authorization.

“Real-time data provides you an opportunity for an additional security wrapper around facilities, properties and people, because you understand a little bit more about what is normal operations and what is not normal,” he said. “Not normal is usually what gets people’s attention.”

5. Find Your Baseline

“When people see real-time data, they get excited. They’re like, ‘Hey, I could do so much if I understood this was happening!’ So you end up with a lot of use cases upfront,” Van Dyke observed. “But, most of the time, people aren’t thinking on the backend. Well, what are you doing to ensure that use case is fulfilled?”

Without proper planning upfront, he said, teams are prone to just slap on sensors that produce data every few seconds, connecting them to the internet and to a server somewhere, which starts ingesting the data.

“It can get overwhelming to your system real fast,” Van Dyke said. “If you don’t have somebody manning this data 24/7, the value of having it there is diminished.”

It’s not a great use of anyone’s time to pay people to stare at a screen 24 hours a day, so you need to set up alerts. But, in order to do that, you need to identify what an outlier is.

That’s why, he said, you need to first understand your data and set up your baseline, which could take up to six months, or even longer, when data points begin to have more impact on each other, like within building automation systems. You have to manage people’s expectations of the value of machine learning early on.

Once you’ve identified your baseline, you can set the outliers and alerts, and go from there.

6. Move to a Real-Time Ready Database

Still, at the beginning of this machine learning journey, Van Dyke says, most machines aren’t set up to handle that massive quantity of data. Real-time data easily overwhelms memory.

“Once you get your backend analysis going, it ends up going through a series of models,” Van Dyke said. “Most of the time, you’ll bring the data in. It needs to get cleaned up. It needs to go through a transformation. It needs to be run through an algorithm for cluster analysis or regression models. And it’s gotta do it on the fly, in real time.”

As you move from batch processing to real-time data, he continued, you quickly realize your existing system is not able to accomplish the same activities at a two- to five-second cadence. This inevitably leads to more delays, as your team has to migrate to a faster backend system that’s set up to work in real time.

This is why the department moved over to Kinetica’s real-time analytics database, which, Van Dyke said, has the speed built in to handle running these backend analyses on a series of models, ingesting and cleaning up data, and providing analytics. “Whereas a lot of the other systems out there, they’re just not built for that,” he added. “And they can be easily overwhelmed with real-time data.”

7. Use Standardization to Work with Non-Tech Colleagues

What’s needed now won’t necessarily be in demand in the next five years, Van Dyke predicted.

“For right now, where the industry is, if you really want to do some hardcore analytics, you’re still going to want people that know the coding, and they’re still going to want to have a platform where they can do coding on,” he said.
“And Kinetica can do that.”

He sees a lot of graphical user interfaces cropping up and predicts ownership and understanding of analytics will soon shift to becoming a more cross-functional collaboration. For instance, the subject matter expert (SME) for building analytics may now be the facilities manager, not someone trained in how to code. For now, these knowledge gaps are closed by a lot of handholding between data scientists and SMEs.

Standardization is essential among all stakeholders. Since everything real time is done at a greater scale, you need to know what your format, indexing and keys are well in advance of going down that rabbit hole.

This standardization is no simple feat in an organization as distributed as the U.S. State Department. However, its solution can be mimicked in most organizations — finance teams are most likely to already have a cross-organizational footprint in place. State controls the master dataset, indexes and meta data for naming conventions and domestic agencies, standardizing it across the government, which it based on the Treasury’s codes.

Van Dyke’s team ensured via logical standardization that “no other federal agency should be able to have its own unique code on U.S. embassies and consulates.”

8. Back-up in Real Time

As previously mentioned, the State Department also splits its data into two streams — one for model building and one for archival back-up. This still isn’t common practice in most real-time data-driven organizations, Van Dyke said, but it follows the control versus variable rule of the scientific method.

“You can always go back to that raw data and run the same algorithms that your real-time one is doing — for provenance,” he said. “I can recreate any outcome that my modeling has done, because I have the archived data on the side.” The State Department also uses the archived data for forensics, like finding patterns of motion around the building and then flagging deviations.

Yes, this potentially doubles the cost, but data storage is relatively inexpensive these days, he said.

The department also standardizes ways to reduce metadata repetition. For example, if a team wants to capture the speed of a fan in a building, but the metadata for that would include the fan’s make and model and the firmware for the fan controller. However, Van Dyke’s team exponentially reduces repetitive data in a table column by leveraging JSON to create nested arrays, which allows the team to decrease the amount of data by associating one note of firmware with all the speed logs.

It’s not just for real time, but in general, Van Dyke said: “You have to know your naming conventions, know your data, know who your stakeholders are, across silos. Make sure you have all the right people in the room from the beginning.”

Data is a socio-technical game, he noted. “The people that have produced the data are always protective of it. Mostly because they don’t want the data to be misinterpreted. They don’t want it to be misused. And sometimes they don’t want people to realize how many holes are in the data or how incomplete the data [is]. Either way, people have become very protective of their data. And you need to have them at the table at the very beginning.”

In the end, real-data best practices rely on collaboration across stakeholders and a whole lot of planning upfront.

The post 8 Real-Time Data Best Practices appeared first on The New Stack.

MongoDB vs. PostgreSQL vs. ScyllaDB: Tractian’s Experience

Joao Pedro Voltani — Wed, 31 May 2023 13:10:04 +0000

Tractian is a machine intelligence company that provides industrial monitoring systems. Last year, we faced the challenge of upgrading our real-time machine learning (ML) environment and analytical dashboards to support an aggressive increase in our data throughput, as we managed to expand our customers database and data volume by 10 times.

We recognized that to stay ahead in the fast-paced world of real-time machine learning, we needed a data infrastructure that was flexible, scalable and highly performant. We believed that ScyllaDB would provide us with the capabilities we lacked, enabling us to push our product and algorithms to the next level.

But you probably are wondering why ScyllaDB was the best fit. We’d like to show you how we transformed our engineering process to focus on improving our product’s performance. We’ll cover why we decided to use ScyllaDB, the positive outcomes we’ve seen as a result and the obstacles we encountered during the transition.

How We Compared NoSQL Databases

When talking about databases, many options come to mind. However, we started by deciding to focus on those with the largest communities and applications. This left three direct options: two market giants and a newcomer that has been surprising competitors. We looked at four characteristics of those databases — data model, query language, sharding and replication — and used these characteristics as decision criteria for our next steps.

First off, let’s give you a deeper understanding of the three databases using the defined criteria:

MongoDB NoSQL

Data model: MongoDB uses a document-oriented data model where data is stored in BSON (Binary JSON) format. Documents in a collection can have different fields and structures, providing a high degree of flexibility. The document-oriented model enables basically any data modeling or relationship modeling.
Query language: MongoDB uses a custom query language called MongoDB Query Language (MQL), which is inspired by SQL but with some differences to match the document-oriented data model. MQL supports a variety of query operations, including filtering, grouping and aggregation.
Sharding: MongoDB supports sharding, which is the process of dividing a large database into smaller parts and distributing the parts across multiple servers. Sharding is performed at the collection level, allowing for fine-grained control over data placement. MongoDB uses a config server to store metadata about the cluster, including information about the shard key and shard distribution.
Replication: MongoDB provides automatic replication, allowing for data to be automatically synchronized between multiple servers for high availability and disaster recovery. Replication is performed using a replica set, where one server is designated as the primary member and the others as secondary members. Secondary members can take over as the primary member in case of a failure, providing automatic fail recovery.

ScyllaDB NoSQL

Data model: ScyllaDB uses a wide column-family data model, which is similar to Apache Cassandra. Data is organized into columns and rows, with each column having its own value. This model is designed to handle large amounts of data with high write and read performance.
Query language: ScyllaDB uses the Cassandra Query Language (CQL), which is similar to SQL but with some differences to match the wide column-family data model. CQL supports a variety of query operations, including filtering, grouping and aggregation.
Sharding: ScyllaDB uses sharding, which is the process of dividing a large database into smaller parts and distributing the parts across multiple nodes (and down to individual cores). The sharding is performed automatically, allowing for seamless scaling as the data grows. ScyllaDB uses a consistent hashing algorithm to distribute data across the nodes (and cores), ensuring an even distribution of data and load balancing.
Replication: ScyllaDB provides automatic replication, allowing for data to be automatically synchronized between multiple nodes for high availability and disaster recovery. Replication is performed using a replicated database cluster, where each node has a copy of the data. The replication factor can be configured, allowing for control over the number of copies of the data stored in the cluster.

PostgreSQL

Data model: PostgreSQL uses a relational data model, which organizes data into tables with rows and columns. The relational model provides strong support for data consistency and integrity through constraints and transactions.
Query language: PostgreSQL uses structured query language (SQL), which is the standard language for interacting with relational databases. SQL supports a wide range of query operations, including filtering, grouping and aggregation.
Sharding: PostgreSQL does not natively support sharding, but it can be achieved through extensions and third-party tools. Sharding in PostgreSQL can be performed at the database, table or even row level, allowing for fine-grained control over data placement.
Replication: PostgreSQL provides synchronous and asynchronous replication, allowing data to be synchronized between multiple servers for high availability and disaster recovery. Replication can be performed using a variety of methods, including streaming replication, logical replication and file-based replication.

What Were Our Conclusions of the Benchmark?

In terms of performance, ScyllaDB is optimized for high performance and low latency, using a shared-nothing architecture and multithreading to provide high throughput and low latencies.

MongoDB is optimized for ease of use and flexibility, offering a more accessible and developer-friendly experience and has a huge community to help with future issues.

PostgreSQL, on the other hand, is optimized for data integrity and consistency, with a strong emphasis on transactional consistency and ACID (atomicity, consistency, isolation, durability) compliance. It is a popular choice for applications that require strong data reliability and security. It also supports various data types and advanced features such as stored procedures, triggers and views.

When choosing between PostgreSQL, MongoDB and ScyllaDB, it is essential to consider your specific use case and requirements. If you need a powerful and reliable relational database with advanced data management features, then PostgreSQL may be the better choice. However, if you need a flexible and easy-to-use NoSQL database with a large ecosystem, then MongoDB may be the better choice.

But we were looking for something really specific: a highly scalable and high-performance NoSQL database. The answer was simple: ScyllaDB is a better fit for our use case.

MongoDB vs. ScyllaDB vs. PostgreSQL: Comparing Performance

After the research process, our team was skeptical about using just written information to make a decision that would shape the future of our product. We started digging to be sure about our decision in practical terms.

First, we built an environment to replicate our data acquisition pipeline, but we did it aggressively. We created a script to simulate a data flow bigger than the current one. At the time, our throughput was around 16,000 operations per second, and we tested the database with 160,000 operations per second (so basically 10x).

To be sure, we also tested the write and read response times for different formats and data structures; some were similar to the ones we were already using at the time.

You can see our results below with the new optimal configuration using ScyllaDB and the configuration using what we had with MongoDB (our old setup) applying the tests mentioned above:

MongoDB vs. ScyllaDB P90 Latency (Lower Is Better)

MongoDB vs. ScyllaDB Request Rate/Throughput (Higher Is Better)

The results were overwhelming. With similar infrastructure costs, we achieved much better latency and capacity; the decision was clear and validated. We had a massive database migration ahead of us.

Migrating from MongoDB to ScyllaDB NoSQL

As soon as we decided to start the implementation, we faced real-world difficulties. Some things are important to mention.

In this migration, we added new information and formats, which affected all production services that consume this data directly or indirectly. They would have to be refactored by adding adapters in the pipeline or recreating part of the processing and manipulation logic.

During the migration journey, both services and databases had to be duplicated, since it is not possible to use an outage event to swap between old and new versions to validate our pipeline. It’s part of the issues that you have to deal with in critical real-time systems: An outage is never permitted, even if you are fixing or updating the system.

The reconstruction process should go through the data science models, so that they can take advantage of the new format, increasing accuracy and computational performance.

Given these guidelines, we created two groups. One was responsible for administering and maintaining the old database and architecture. The other group performed a massive reprocessing of our data lake and refactored the models and services to handle the new architecture.

The complete process, from designing the structure to the final deployment and swap of the production environment, took six months. During this period, adjustments and significant corrections were necessary. You never know what lessons you’ll learn along the way.

NoSQL Migration Challenges

ScyllaDB can achieve this kind of performance because it is designed to take advantage of high-end hardware and very specific data modeling. The final results were astonishing, but it took some time to achieve them. Hardware has a significant impact on performance. ScyllaDB is optimized for modern multicore processors and uses all available CPU cores to process data. It uses hardware acceleration technologies such as AVX2 (Advanced Vector Extensions 2) and AES-NI (Advanced Encryption Standard New Instructions); it also depends on the type and speed of storage devices, including solid-state disks and NVMe (nonvolatile memory express) drives.

In our early testing, we messed up some hardware configurations, leading to performance degradation. When those problems were fixed, we stumbled upon another problem: data modeling.

ScyllaDB uses the Cassandra data model, which heavily dictates the performance of your queries. If you make incorrect assumptions about the data structures, queries or the data volume, as we did at the beginning, the performance will suffer.

In practice, the first proposed data format ended up exceeding the maximum size recommended for a ScyllaDB partition in some cases, which made the database perform poorly.

Our main difficulty was understanding how to translate our old data modeling to one that would perform on ScyllaDB. We had to restructure the data into multiple tables and partitions, sometimes duplicating data to achieve better performance.

Lessons Learned: Comparing and Migrating NoSQL Databases

In short, we learned three lessons during this process: Some came from our successes and others from our mistakes.

When researching and benchmarking the databases, it became clear that many of the specifications and functionalities present in the different databases have specific applications. Your specific use case will dictate the best database for your application. And that truth is only discovered by carrying out practical tests and simulations of the production environment in stressful situations. We invested a lot of time, and our choice to use the most appropriate database paid off.

When starting a large project, it is crucial to be prepared for a change of route in the middle of the journey. If you developed a project that did not change after its conception, you probably didn’t learn anything during the construction process, or you didn’t care about the unexpected twists. Planning cannot completely predict all real-world problems, so be ready to adjust your decisions and beliefs along the way.

You shouldn’t be afraid of big changes. Many people were against the changes we were proposing due to the risk it brought and the inconvenience it caused to developers (by changing a tool already owned by the team to a new tool that was completely unknown to the team).

Ultimately, the decision was driven based on its impact on our product improvements — not on our engineering team, even though it was one of the most significant engineering changes we have made to date.

It doesn’t matter what architecture or system you are using. The real concern is whether it will be able to take your product into a bright future.

This is, in a nutshell, our journey in building one of the bridges for the future of Tractian’s product. If you have any questions or comments, feel free to contact us.

The post MongoDB vs. PostgreSQL vs. ScyllaDB: Tractian’s Experience appeared first on The New Stack.

Oracle Support for MySQL 5.7 Ends Soon, Key Upgrades in 8.0

Jelani Harper — Tue, 30 May 2023 17:35:59 +0000

Oracle will end support for MySQL 5.7, an open source relational database it’s developed, on Oct. 31, 2023. The 5.7 version, for which Oracle still provides Extended Support, was released in 2015.

The upcoming end-of-life date means Oracle will no longer provide updates to this version of the database, which is widely used by some of Silicon Valley’s biggest names for web-based applications. Organizations may still continue to use this solution for as long as they want.

The scant remaining time until Oracle’s support expires has renewed interest in MySQL 8.0, which was initially released five years ago. Oracle currently offers Premier Support and Extended Support for this edition, which will expire in 2025 and 2026, respectively.

The 8.0 offering features a plethora of upgrades that include “a whole slew of things in there,” summarized Dave Stokes, Percona Technology Evangelist. “I would say it’s like jumping from a 12-inch black and white TV to a 70-inch 4K TV. It’s a big step forward.”

Some of the more noteworthy advancements in the 8.0 version include a formal data dictionary and greater JSON functionality. There are also mechanisms for helping organizations understand — and act on — requirements for upgrades for their particular version of the database.

Upgrade Utility

MySQL Shell contains functionality for helping users transition to MySQL 8.0. According to Stokes, “If you run the MySQL Shell, there’s a utility called Check For Server Upgrade that will go out there and run 21 or 23 different checks.” Those checks evaluate a host of factors relevant to upgrading to MySQL 8.0, including everything from partitioning issues to outdated character sets.

In addition to examining which facets of the current version need to be updated, this feature also provides timely advice for rectifying issues to make the process smoother. “It doesn’t fix it for you; you have to do it,” Stokes commented. “It says, ‘here’s how you solve this issue, or here’s where the manual page is to help you solve this issue’.”

Metadata Dictionary

One of the immediate benefits of adopting MySQL 8.0 is its data dictionary, which houses metadata about the offering in a single, centralized locus. In previous versions of the database, such metadata was scattered about in separate files in numerous places. “If you have a junior administrator that saw all these little bitty files and decided to clean them up, then suddenly the database [would] stop working,” Stokes remarked. Centralizing this metadata makes operations more efficient and enhances system performance.

For starters, it makes it possible to glean, in a single query, specifics about the structure of tables and their schema. Without this capability, to obtain this same information “with 5.7 you’re actually opening up a table and looking at the definition of a table,” Stokes said. “You’re opening the file, reading it, then taking a look at it and figuring, if I want to use this information how do I parse this in somehow and handle it?” The query approach of 8.0 automates and accelerates such manual processing, as well as others involving regular expressions.

JSON

The JSON enhancements MySQL 8.0 offers are particularly useful, expanding the value of what Stokes termed “the data interchange format of choice.” On the one hand, there is a JSON table function that enables users to temporarily position the unstructured and semi-structured data in JSON objects into a table. That way, organizations can readily query that data via SQL techniques like “windowing function, aggregate function, sort by, group by… all that sort of stuff,” Stokes noted. When doing so, the actual JSON data will remain unstructured (or semi-structured), while there is a structured data copy of it.

Alternatively, the latest version of MySQL has ways to “take that unstructured data and materialize it into its own column and make it permanently structured,” Stokes revealed. “So, if you have stuff and you want to make it permanently structured, so you can index it and make it faster for search, you can.” These capabilities are crucial for dealing with the massive quantities of unstructured data organizations regularly encounter. According to Stokes, it’s also pertinent for working with all of the data formatted in JSON, which spans everything from tiny Internet of Things devices to some of the world’s biggest databases.

A Logical Progression

Upgrading from MySQL 5.7 to MySQL 8.0 is a logical progression. There are numerous enhancements in the latter version, like its data dictionary and JSON improvements, that make doing so worthwhile. Moreover, there are dedicated mechanisms to ease the upgrade process and pinpoint exactly what organizations need to do to make this transition work.

The post Oracle Support for MySQL 5.7 Ends Soon, Key Upgrades in 8.0 appeared first on The New Stack.

Raft Native: The Foundation for Streaming Data’s Best Future

Doug Flora — Tue, 30 May 2023 16:44:09 +0000

Consensus is fundamental to consistent, distributed systems. To guarantee system availability in the event of inevitable crashes, systems need a way to ensure each node in the cluster is in alignment, such that work can seamlessly transition between nodes in the case of failures.

Consensus protocols such as Paxos, Raft, View Stamped Replication (VSR), etc. help to drive resiliency for distributed systems by providing the logic for processes like leader election, atomic configuration changes, synchronization and more.

As with all design elements, the different approaches to distributed consensus offer different tradeoffs. Paxos is the oldest consensus protocol around and is used in many systems like Google Spanner, Apache Cassandra, Amazon DynamoDB and Neo4j.

Paxos achieves consensus in a three-phased, leaderless, majority-wins protocol. While Paxos is effective in driving correctness, it is notoriously difficult to understand, implement and reason about. This is partly because it obscures many of the challenges in reaching consensus (such as leader election, and reconfiguration), making it difficult to decompose into subproblems.

Raft (for reliable, replicated, redundant and fault-tolerant) can be thought of as an evolution of Paxos — focused on understandability. This is because Raft can achieve the same correctness as Paxos but is more understandable and simpler to implement in the real world, so often it can provide greater reliability guarantees.

For example, Raft uses a stable form of leadership, which simplifies replication log management. And its leader election process, driven through an elegant “heartbeat” system, is more compatible with the Kafka-producer model of pushing data to the partition leader, making it a natural fit for streaming data systems like Redpanda. More on this later.

Because Raft decomposes the different logical components of the consensus problem, for example by making leader election a distinct step before replication, it is a flexible protocol to adapt for complex, modern distributed systems that need to maintain correctness and performance while scaling to petabytes of throughput, all while being simpler to understand to new engineers hacking on the codebase.

For these reasons, Raft has been rapidly adopted for today’s distributed and cloud native systems like MongoDB, CockroachDB, TiDB and Redpanda to achieve greater performance and transactional efficiency.

How Redpanda Implements Raft Natively to Accelerate Streaming Data

When Redpanda founder Alex Gallego determined that the world needed a new streaming data platform — to support the kind of gigabytes-per-second workloads that bring Apache Kafka to a crawl without major hardware investments — he decided to rewrite Kafka from the ground up.

The requirements for what would become Redpanda were: 1) it needed to be simple and lightweight to reduce the complexity and inefficiency of running Kafka clusters reliably at scale; 2) it needed to maximize the performance of modern hardware to provide low latency for large workloads; and 3) it needed to guarantee data safety even for very large throughputs.

The initial design for Redpanda used chain replication: Data is produced to node A, then replicated from A to B, B to C and so on. This was helpful in supporting throughput, but fell short for latency and performance, due to the inefficiencies of chain reconfiguration in the event of node downtime (say B crashes: Do you fail the write? Does A try to write to C?). It was also unnecessarily complex, as it would require an additional process to supervise the nodes and push reconfigurations to a quorum system.

Ultimately, Alex decided on Raft as the foundation for Redpanda consensus and replication, due to its understandability and strong leadership. Raft satisfied all of Redpanda’s high-level design requirements:

Simplicity. Every Redpanda partition is a Raft group, so everything in the platform is reasoning around Raft, including both metadata management and partition replication. This contrasts with the complexity of Kafka, where data replication is handled by ISR (in-sync replicas) and metadata management is handled by ZooKeeper (or KRaft), and you have two systems that must reason with one another.

Performance. The Redpanda Raft implementation can tolerate disturbances to a minority of replicas, so long as the leader and a majority of replicas are stable. In cases when a minority of replicas have a delayed response, the leader does not have to wait for their responses to progress, mitigating impact on latency. Redpanda is therefore more fault-tolerant and can deliver predictable performance at scale.

Reliability. When Redpanda ingests events, they are written to a topic partition and appended to a log file on disk. Every topic partition then forms a Raft consensus group, consisting of a leader plus a number of followers, as specified by the topic’s replication factor. A Redpanda Raft group can tolerate ƒ failures given 2ƒ+1 nodes; for example, in a cluster with five nodes and a topic with a replication factor of five, two nodes can fail and the topic will remain operational. Redpanda leverages the Raft joint consensus protocol to provide consistency even during reconfiguration.

Redpanda also extends core Raft functionality in some critical ways to achieve the scalability, reliability and speed required of a modern, cloud native solution. Redpanda enhancements to Raft tend to focus on Day 2 operations, for instance how to ensure the system runs reliably at scale. These innovations include changes to the election process, heartbeat generation and, critically, support for Apache Kafka acks.

Redpanda’s optimistic implementation of Raft is what enables it to be significantly faster than Kafka while still guaranteeing data safety. In fact, Jepsen testing has verified that Redpanda is a safe system without known consistency problems and a solid Raft-based consensus layer.

But What about KRaft?

While Redpanda takes a Raft-native approach, the legacy streaming data platforms have been laggards in adopting modern approaches to consensus. Kafka itself is a replicated distributed log, but it has historically relied on yet another replicated distributed log — Apache ZooKeeper — for metadata management and controller election.

This has been problematic for a few reasons: 1) Managing multiple systems introduces administrative burden; 2) Scalability is limited due to inefficient metadata handling and double caching; 3) Clusters can become very bloated and resource intensive — in fact, it is not too uncommon to see clusters with equal numbers of ZooKeeper and Kafka nodes.

These limitations have not gone unacknowledged by Apache Kafka’s committers and maintainers, who are in the process of replacing ZooKeeper with a self-managed metadata quorum: Kafka Raft (KRaft).

This event-based flavor of Raft achieves metadata consensus via an event log, called a metadata topic, that improves recovery time and stability. KRaft is a positive development for the upstream Apache Kafka project because it helps alleviate pains around partition scalability and generally reduces the administrative challenges of Kafka metadata management.

Unfortunately, KRaft does not solve the problem of having two different systems for consensus in a Kafka cluster. In the new KRaft paradigm, KRaft partitions handle metadata and cluster management, but replication is handled by the brokers using ISR, so you still have these two distinct platforms and the inefficiencies that arise from that inherent complexity.

The engineers behind KRaft are upfront about these limitations, although some exaggerated vendor pronouncements have created ambiguity around the issue, suggesting that KRaft is far more transformative.

Combining Raft with Performance Engineering: A New Standard for Streaming Data

As data industry leaders like CockroachDB, MongoDB, Neo4j and TiDB have demonstrated, Raft-based systems deliver simpler, faster and more reliable distributed data environments. Raft is becoming the standard consensus protocol for today’s distributed data systems because it marries particularly well with performance engineering to further boost the throughput of data processing.

For example, Redpanda combines Raft with speedy architectural ingredients to perform at least 10 times faster than Kafka at tail latencies (p99.99) when processing a 1GBps workload, on one-third the hardware, without compromising data safety.

Traditionally, GBps+ workloads have been a burden for Apache Kafka, but Redpanda can support them with double-digit millisecond latencies, while retaining Jepsen-verified reliability. How is this achieved? Redpanda is written in C++, and uses a thread-per-core architecture to squeeze the most performance out of modern chips and network cards. These elements work together to elevate the value of Raft for a distributed streaming data platform.

Redpanda vs. Kafka with KRaft performance benchmark – May 11, 2023

An example of this in terms of Redpanda internals: Because Redpanda bypasses the page cache and the Java virtual machine (JVM) dependency of Kafka, it can embed hardware-level knowledge into its Raft implementation.

Typically, every time you write in Raft you have to flush to guarantee the durability of writes on disk. In Redpanda’s approach to Raft, smaller intermittent flushes are dropped in favor of a larger flush at the end of a call. While this introduces some additional latency per call, it reduces overall system latency and increases overall throughput, because it is reducing the total number of flush operations.

While there are many effective ways to ensure consistency and safety in distributed systems (Blockchains do it very well with Proof of Work and Statement of Work protocols), Raft is a proven approach and flexible enough that it can be enhanced to adapt to new challenges.

As we enter a new world of data-driven possibilities, driven in part by AI and machine learning use cases, the future is in the hands of developers who can harness real-time data streams. Raft-based systems, combined with performance-engineered elements like C++ and thread-per-core architecture, are driving the future of data streaming for mission-critical applications.

The post Raft Native: The Foundation for Streaming Data’s Best Future appeared first on The New Stack.

Building AI-Driven Applications with a Multimodal Approach

Rahul Pradhan — Tue, 30 May 2023 15:44:39 +0000

Generative AI is the latest era of AI/machine learning (ML) that is unlocking new opportunities and tackling previously unaddressed challenges. It can create new content and ideas, including conversations, stories, images, videos and music. It is powered by large models that are pretrained on vast amounts of data and commonly referred to as foundation models (FMs).

The use of AI to maximize an organization’s data strategy can help boost employee productivity, collaboration and enable smarter decision-making, resulting in real, quantifiable business value. It enables developers to reimagine their applications, create new customer experiences, while driving faster time to revenue to transform businesses.

The Evolution of Generative AI

In spite of the excitement around AI, we are still in the early stages of its transformation. The current wave of generative AI is focused on creating new content based on a snapshot of data that existed. This has tremendous value to improve and automate business processes.

However, the real value of generative AI is the next phase, where it can enable driving insights and decision-making. These early generative AI applications are all content-oriented, focused on processing vast amounts of data to produce fairly accurate results. However, the accuracy of these results is not enough to drive decisions.

The first generation of these applications work really well for the business-to-customer market where sifting through large quantities of data quickly to produce a condensed or summarized view has a huge benefit.

Often, the generated content serves as a starting point for more focused human involvement in completing the job to be done. It accelerates the low-value-add work where people spent a lot of time, whether it was building an outline or writing code, building data sets for testing or performing an analysis on data. It accelerates human productivity to work on the tasks that are not mission critical and that can be undone or revisited with relatively less effort. Copilot X is a great example of a feature where the timeliness of the output matters more than accuracy of the content.

Quantity Versus Quality

However, the real value of AI comes from the ability to draw meaningful insights and drive outcome-based decision-making. These are the cases where quality and accuracy matter more than anything.

For AI to trigger the transformation that it can deliver, it needs to move from low-impact, low-risk content generation for low-fidelity or accuracy use cases to high-impact, high-risk analysis that drives decision-making and needs high fidelity and accuracy.

For enterprises to evaluate the return on investment (ROI) of these new AI features and determine how they can uniquely differentiate themselves to customers, business-to-business applications need to drive outcomes and justify their ROI. They need to do this in the context of their existing data ecosystem and in a secure manner.

Democratizing AI

The rapid advancements and adoption of AI has democratized the access to the core technology at the heart of this resolution — foundation models. The tremendous pace of innovation around large language models (LLMs) in the open source community has triggered the creation of several open source LLMs that are democratizing the access to the core technology on which to build an AI business.

Smaller models also enable them to be run on lower-powered hardware, including an iPhone, thus reducing the barrier to tinkering and creativity. The availability of these smaller but sufficiently high-quality models has fueled innovation and unlocked possibilities for individuals and institutions around the world. In the end, the best model is the one that can learn constantly and quickly and hence can be iterated and fine-tuned in a short period of time.

This has reduced the barrier to entry for training and experimentation from a few very large enterprises or the output of a major research organization to one person, an evening and a beefy laptop or even a personal computing device like an iPhone. These advances will in turn lead enterprises to evaluate the impact of AI in the context of their own businesses.

Multimodal Platforms Hold the Key to Driving AI-Powered Apps

To be successful, enterprises need access to multiple foundation models that include multiple modalities of text, images and videos that can be fine-tuned on their proprietary data to deliver unique and business-relevant insights. They want to make it easy to take a base FM and build differentiated apps using their own data. Data is more important than ever and is the cornerstone of a successful AI initiative. This includes a reliable and performant data platform that supports unstructured operational and nonoperational data with deeper integration with the analytics and ML platforms to enable prescriptive analytics.

To achieve this in real time, there needs to be a deeper integration of the most recent up-to-the-minute data stored in an operational database. Databases need to be conversant with processing multiple modalities of data without introducing performance and latency overhead.

By building on top of a multimodal and low-latency data platform, enterprises can realize the vision of prescriptive analytics and drive AI-powered applications.

The post Building AI-Driven Applications with a Multimodal Approach appeared first on The New Stack.

Alteryx Announces AiDIN for AI-Powered Features

Andrew Brust — Fri, 26 May 2023 15:06:16 +0000

At its Inspire conference in Las Vegas on Wednesday, long-time data integration and AI player Alteryx announced a series of new platform capabilities, many of which were focused on generative artificial intelligence, and several more of which impressively beef up its hybrid cloud capabilities. While a number of data and analytics companies have recently added large language model (LLM)-based capabilities to their platforms, the majority of these have been focused on data exploration and querying. Alteryx’s announcements do provide coverage there, but they also address the areas of data pipeline governance, and the functionality within data pipelines themselves. These are perhaps less flashy capabilities, but are arguably just as substantive, if not more so.

Also read: Alteryx Integrates with UiPath, Uniting RPA and Data Pipelines

Suresh Vittal, Alteryx’s Chief Product Officer, briefed The New Stack on six major announcements, three of which are in the generative AI realm, and three of which address cloud capabilities of the core Alteryx platform. These follow quickly on Alteryx’s 2022 acquisition of Trifacta and the momentum that it has driven. As Vittal said, Alteryx has been “hard at work integrating the Trifacta platform. We announced the Alteryx Analytics cloud with machine learning, Designer Cloud and Auto Insights built on a common unified platform. That’s getting great uptake and our cloud has been up and running for several months now.”

Check out: Alteryx Analytics Cloud Consolidates Acquisitions and Functionality

AI, above the Din

On the generative AI side, Alteryx has announced AiDIN, which serves as the umbrella brand and engine for each of Alteryx’s AI capabilities, both old and new. According to Vittal, AiDIN is the “core framework of bringing… Alteryx’s data and Alteryx’s models, combining those capabilities and powering specific use cases.” The AiDIN-related announcements include:

A Workflow Summary Tool, which can create natural language summaries of any one or group of Alteryx workloads. Essentially, the tool has the ability to document what workflows do, after they’ve been authored, which can help engineers understand assets that they need to become acquainted with, or proactively document their own work. The summaries can be embedded in a workflow’s Meta Info field, which ensures the generated summary attaches itself to the workflow regardless of who accesses it and when.

The Workflow Summary Tool in action
Credit: Alteryx

A feature called Magic Documents, which is essentially a new adjunct to Alteryx’s already existing Auto Insights feature, is itself driven by AI. Now, in addition to creating Auto Insights, Alteryx customers can leverage generative AI to create a conversational email message, PowerPoint slide presentation or other document that summarizes the generated insights. So not only can AI generate a report, but it can now generate a cover letter of sorts, to accompany the report. This is a useful tool to summarize such reports for managers or executives who may not have time to review them in full, but still need to know what’s in them.

Magic Documents email generation.
Credit: Alteryx

An OpenAI Connector, embeddable in Alteryx workflows, which can call APIs in OpenAI’s generative AI platform as an automated step in a data pipeline. This takes generative AI beyond interactive chatbot scenarios and into triggered, data-driven actions that are executed autonomously. Vittal explained that this connector is for OpenAI’s own platform, and that connectors for Azure OpenAI and for customers’ own models will be forthcoming. Google AI service connectors may be added to the mix as well.

Use Cases, Today and Tomorrow

Alteryx’s applications of large language models and generative AI are more infrastructural than many which have surfaced in the analytics space recently. They aren’t (yet) about using natural language to generate assets like reports or workflows, but rather using the technology to extend the reach, management, power and capabilities of those assets.

These generative AI-based capabilities extend primarily to natural language use cases, but LLMs can be used for scenarios beyond natural language, especially when combined with a customer’s own data. Vittal expects that Alteryx may move to certain of these scenarios soon, stating that “…we’re trying to …decouple the foundational model and the work that happens with the foundational model from the contextualized training and find… I probably think about it as fine-tuning… that we can do [this] using the customer’s data and Alteryx’s data.”

Vittal also told The New Stack that the company has been “working with our design partners on things like metadata enrichment, things like orchestration of specific very complex operational processes. And we’re finding that there’s real value in applying generative AI to these kinds of use cases because it takes a lot of tasks out of the process.”

AI Isn’t Everything

This is all neat stuff, and there’s no disputing how cool and transformational AI is. But, as I mentioned earlier, Alteryx made a few important non-AI related announcements too. They include:

So-called “cloud-connected experiences,” like Cloud Execution for Desktop, a hybrid cloud feature that allows customers to author Alteryx workflows in the Alteryx Designer desktop application, then save them to, and have them execute in, the cloud.
New Enterprise Utilities for enhanced governance, including Alteryx product telemetry data to manage usage across clouds, and the ability to treat workflows as code, then curate them and manage their deployment by pushing them to Git-compatible version control repositories
New Location Intelligence capabilities which have been rewritten for the cloud, to take advantage of the extended resources and elastic computing power provided there. This makes new use cases possible because more powerful spatial data workloads can be accommodated in the cloud. Alteryx is announcing pushdown query integration with Snowflake, and integrations with TomTom, to attain this improved performance and enable the new use cases.

Must read: Snowflake Builds out Its Data Cloud

AI Economics

That’s a lot of developments to absorb. Despite the recessionary air and austerity on both the customer and vendor sides of the data realm today, Alteryx seems to be doing quite well. Vittal put it this way: “Last quarter… we expanded 121% which is kind of best-in-class expansion. The largest of our accounts expanded 131%. So we’re continuing to see demand and durability of use cases even in this macro, some might argue in this macro even more so, because more and more the teams have to do more with less. And so automation and analytics orchestration becomes more important.”

One might argue that generative AI is creating its own economic bump for our industry. While some of that may be merely a hype-driven bubble, applications of AI like the ones Alteryx has implemented add real practicality and productivity. The latter typically drives healthy economic expansion, which is something we should all be rooting for. If we can move past the “spectacle” of AI, and focus on its down-to-earth utility, good things can result.

The post Alteryx Announces AiDIN for AI-Powered Features appeared first on The New Stack.