The Rise of Databases 2.0 With GenAI

Published in

madhukarkumar

9 min readMar 2, 2024

In 1970, a seemingly uncanny paper titled — “A relational model of data for large, shared data banks,” was published by E.F. Codd from IBM. It introduced the notion of relational data. A few years later, Larry Ellison, inspired by the paper, created the first commercial relational database.

And thus, Oracle, the behemoth company we all know of, was born.

As we mark the 50th anniversary of databases, we are merely in year 2 of generative AI and one thing that several people are pondering is how this is going to change/affect Databases.

As someone who is now working for the third database company (Oracle, Redis and now SingleStore), this question was on my mind lately when I stumbled upon the Codd paper.

What I read in the first few lines of the paper shocked me with its eerily prescient foresight.

“Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Screenshot of Codd paper on Relational Data

This, in short, is the definition and the roadmap of Database 2.0.

Now imagine a world where you do not need to know anything about databases to use them. A world where you can simply ask a question in natural language and get an answer in the format of your choice. A world where databases are not just passive containers of data, but active agents of knowledge that can reason, analyze, and contextualize information. In real time.

A knowledge box of sorts where the complexities of the bits and bytes are invisible, and you get what you want with the ease of use of talking to your assistant.

However, to understand how to get to Database 2.0 and whether we are already here, we need to understand what does Database 1.0 look like and how it evolved from a row and columns relational data structure to the current state of a tangled web of complexity.

Evolution of Databases — Speed

The co-founder of Google, Larry Page was once quoted as saying — “Speed is a feature. Speed can drive usage as much as having bells and whistles on your product. People really underappreciate it.”

Databases have come a long way since the 1970s and relational databases revolutionized the way data is stored, organized, and queried digitally, using a structured and consistent schema and a powerful query language - Structured Query Language (SQL), the lingua franca of data. The early consumers of this were simple applications that then turned complex over time like ERP, CRM and eventually the eCommerce web applications.

These application needed access to structured data that could be written, read and updated at speed to keep up with the rising user expectations of using these applications. This led to the birth of in-memory cache and databases. The three in-memory data stores that were well known in this new breed were Redis (open-source key-value store), Memcached and MemSQL (now SingleStore) that specialized in running in-memory SQL database.

As the development of applications evolved over time with the advent of cloud and new application frameworks that were more developer friendly, databases had to now evolve to support new kinds of data types.

In the 2000s, a new generation of databases emerged, known as NoSQL databases, which stands for Not Only SQL. These databases can handle various kinds of data and offer different trade-offs between consistency, availability, and partition tolerance. They can be classified into four main types: key-value stores, document stores, column-family stores, and graph databases. Each type has its own advantages and disadvantages, depending on the use case and application requirements. For example, key-value stores are highly scalable and performant, but do not support complex queries or relationships. Document stores are flexible and expressive, but lack schema enforcement and data integrity. Column-family stores are efficient and compact, but hard to query and modify. Graph databases are powerful and intuitive, but expensive and complex.

However, NoSQL databases were not designed for transactional processing or analytical processing. They were mainly used for operational processing, where the focus is on storing and retrieving data quickly and reliably without worrying too much about the schema or the shape of the data. To support transactional processing, where the focus is on ensuring data consistency and integrity across multiple operations, relational databases were still the preferred choice. These databases are optimized for OLTP (Online Transaction Processing), which involves short and frequent transactions that modify a small number of records. For example, online banking, e-commerce, or reservation systems.

To support analytical processing, where the focus is on aggregating and analyzing large amounts of data to derive insights and trends, a new type of databases emerged, known as data warehouses. These databases are optimized for OLAP (Online Analytical Processing), which involves complex and long-running queries that scan millions or billions of records. For example, business intelligence, data mining, or reporting. These OLAP and data warehouses are usually built on top of relational databases, but they use a different data model, known as star schema or snowflake schema, to facilitate analytical queries. They also employ various techniques, such as indexing, partitioning, compression, and caching, to improve performance and scalability.

Evolution of Databases — Scale

When the application started to move to the cloud, the scale of data generated and needed to be stored had grown exponentially across organizations. Businesses wanted to store all of this data and they also wanted to run analytics to drive decisions based on insights. This gave birth to, yes, you probably guessed it — Hadoop and eventually data warehouses in the cloud. This category now has companies like Snowflake, BigQuery (from Google) and Redshift (from AWS).

Eventually, the databases that solved for speed could not keep up with scale and those that optimized for scale came at the expense of speed forcing businesses to now deploy multiple databases.

One of the main challenges of using different kinds of databases for different purposes is the need to integrate and synchronize data across them. The ferrying of data across multiple systems necessitated the birth of an industry around data integration and Extract, Transform and Load (ETL). Needless to say all of this is costly, complex, and error-prone, especially when dealing with large and heterogeneous data sets. Furthermore, it can create inconsistency and latency issues, as the data in one database may not reflect the latest changes in another database.

To address this challenge, some databases have attempted to combine the best features of both worlds and offer a hybrid solution that supports both OLTP and OLAP workloads. These databases are known as HTAP (Hybrid Transactional/Analytical Processing) databases, which enable real-time analytics on operational data without compromising performance or scalability. For example, SAP HANA (entirely memory based), or SingleStore (formerly MemSQL) that managed to add both speed and scale.

Having evolved based on speed and scale we are now in the third phase of generative AI where databases need to yet again morph thanks to GenAI.

What does generative AI need from databases?

The evolution Database 1.0 was driven by the need of changing landscape of applications and the landscape is now going through a tectonic shift.

However, this time around, it is not just about speed and scale, it is about something far more natural. Something that was called out in the Codd paper prophetically more than half a century ago.

When it comes to generative AI applications we are seeing applications under two broad categories — Information Retrieval and Agentic. The former is about using Large Language Models (LLMs) to synthesize knowledge and the latter is about using tools to take the knowledge and context to execute actions.

Both use cases, it turns out, require knowledge and context.

The combination of knowledge and context leads to choices and actions. Given that every application have access to the same LLMs and similar tools (APIs and functions), what differentiates the GenAI application is now the knowledge aka the data and there in lies the requirements for Database 2.0.

First, let’s remember that for the first time, the applications, specifically the GenAI applications are now creating generating data at an unimaginable rate. This brings us back to the notion of speed.

Second, it is generating data in the form of images, text, audio, or video (multiple datatypes) but it does so only after contextualizing the existing data for the LLM. This brings in scale and complexity.

The fairly established way of doing contextualizing data for LLMs now is through a methodology called Retrieval Augmented Generation or RAG for short.

Since AI is now moving toward real-time AI with audio and video interfaces, RAG requires contextualizing data not just across multiple data types at petabytes scale, it also needs to do that in a few milliseconds.

But here is the thing, now, more than ever, generative AI has shown us that along with speed and scale, databases need to be also simple.

In other words, we need to conquer the complexity of multiple data types, cut through ETLs, the need for cataloging, creating projections, running pipeline jobs etc. because now, more than ever, GenAI needs data in real-time in order to make the right choices and execute the right actions.

Anything short leads to not only bad or inaccurate data but downright harmful and dangerous downstream actions.

How do we get to Database 2.0?

It turns out that in year one of GenAI it was all about bringing the data to the LLMs (through RAG). In year two and beyond, we need to now also bring the LLMs into the database.

If we break down the requirements for GenAI apps, it needs to have a knowledge box that can ingest any data (multiple datatypes) in real-time, put them in the right places, process (for example vectorize), analyze (run complex analytics and aggregation functions) and then be able contextualize with the existing knowledge (also multiple data types). It needs to do all of this in a few milliseconds across petabytes of data.

In short, Database 2.0 should have “intelligent” features like transactional and analytical storage and functions, SQL and NoSQL features, hybrid search (vector and exact keyword match) and should be able to manage all of this with speed, at scale and with simplicity. For example, the ability to talk to all of this knowledge in natural language and all the existing interfaces like SQL, API etc.

There are only a couple of databases today that can do this today. SAP Hana is one of them that recently added vector search capabilities but given its in-memory pedigree is expensive to say the least.

SingleStore (formerly MemSQL) is another data platform that uses patented technology to store data as both row based relational and columnar data storage that gives it transactional and analytical capabilities and have customers that run workloads across petabytes scale. In addition, it has pipelines to bring in all kinds of data (all datatypes), including real-time streaming data that can be processed and vectorized on the fly so that data can be contextualized and sent to LLMs in a few milliseconds. This process is called real-time aka Live RAG. It also recently added an intelligent compute layer and support for both vector and exact keyword match.

To conclude, the next generation of databases should be able to handle diverse and complex data types and queries, while providing fast, scalable and simple solutions. It should do this simplistically, almost like the Prime Radiant cube famously depicted in Issac Asimov’s series — Foundation.

GenAI can now finally make Database 2.0 into a reality. Only when we achieve simplicity along with the other two key attributes of speed and scale, we will see the evolution of GenAI apps to useful Agents, Assistants and eventually Artificial General Intelligence or AGI.

The Rise of Databases 2.0 With GenAI

Evolution of Databases — Speed

Evolution of Databases — Scale

What does generative AI need from databases?

How do we get to Database 2.0?

Written by Madhukar Kumar