Why Can’t We Have AI-Driven Database Design?

Paige Roberts
11 min readFeb 21, 2023

--

Two mountain climbers at a mountain’s peak all alone with the sun shining on them.

One of the most time-consuming, expensive, expertise-intensive aspects of getting a new analytics database up and running is data modeling and overall database design. There’s this huge, months long project that involves interviewing all the folks wanting to do analytics — trying to understand their requirements, and then understand the data in relation to those requirements. How much data do they need to analyze, how fast is that data coming in, what kind of analyses do they do on it? Will they need to power a dashboard, run an application, analyze locations, deal with the avalanche of IoT data? Usage patterns? Constraints? Primary keys? Naming conventions? Indexes?

So many questions, and so much time and effort by smart people to determine how a database’s tables, indexes, etc. should be set up.

A few companies are pioneering a better way.

AI-Driven Data Modelling

You’ll be forgiven if you didn’t know that a few databases can create their own data model, custom bespoke fitted. There are only two of them: Vertica and StarTree.ai, the commercial version of open source database Apache Pinot. This capability to use a smart application to design the database is only in the commercial StarTree version and is relatively young tech. Vertica has it in all versions, including the free version, and it has offered this for about 10 years. In many ways, they’re very similar databases, with the only obvious difference being the indexing. More about that in a minute.

Recently, I was frustrated when an industry analyst dinged Vertica, where I work, as having no custom data modelling because it didn’t come with out-of-the box pre-made data models for certain industries like the financial or healthcare industry. That’s a lot like a suit designer being dinged for not having blue and brown suits in standard sizes that you can get off the rack. In the meantime, our customers are all wearing their perfectly-tailored suits in checkers and plaids and pin stripes that change color based on their mood and the time of day. I wonder if StarTree gets dinged for the same reason.

Let me discuss indexes for a minute, then I’ll dive into how AI-driven database design works in the two databases.

Indexes

Indexes have never been a great way to improve query response. I know, that’s a hugely controversial statement in the database world that might get me chased by an angry mob with pitchforks and laptops.

It’s like some engineer back in the day thought, my data is all over the place and I have to do full table scans all the time to find it. That’s too slow, but rather than figure out a better way, the engineer settled for a stopgap that stores pointers to the data. It’s got some drawbacks over the long term, with indexes growing huge and having to be updated and maintained all the time, but it will do for now.

And it stuck.

That does seem to be the way with any code created to be a short-term fix.

Vertica has always been a bit of a pioneer, first to do columnar data storage since SybaseIQ in the 80’s, first to say no to leader/master nodes aka bottlenecks on clusters, first to do cloud style object storage usage and elastic compute on prem, first to do geospatial analysis, time series analysis, and machine learning all in a single database, etc. Pinot and lots of other columnar databases and data storage formats owe a lot to that pioneering.

But being first has a drawback: There’s nothing to compare the results of your pioneering efforts to. Folks aren’t familiar with this newfangled unique thing you do, so they’re pretty sure it can’t be as good as what everyone else is doing. Vertica has no indexes at all, for instance, and it’s the only database I know of that’s built that way. We get dinged on that, too, by the analysts. No indexes, check. Nevermind that we use a different strategy that optimizes the data. It gives higher performance repeatedly proven in head to head benchmarks and POCs. It’s like getting dinged for not having fuel-saving measures when yours is an electric car.

Vertica doesn’t check the standard boxes, but that doesn’t mean it’s deficient.

The Vertica data model is the key to avoiding all the indexing overhead. No indexes or pointers are needed because Vertica knows where all the data is stored and can find it fast without first scanning a pointer list, then using that to go find the data. And because Vertica doesn’t store data like any other database, Vertica engineers realized early on that no DBAs out there are experts in the Vertica style of data modeling. DBAs don’t learn about storing data in projections when they get certified. They learn about indexes. And we don’t have those.

StarTree also figured old school indexes were not a great idea, but they used a different strategy: They optimized the indexes. Pinot indexes are not just a single list of pointers, but multiple lists, each optimized for a different type of analysis, so the query optimizer can choose the best pointer list from several options.

So, since no one else does data modelling like Vertica does, or optimizes indexes like StarTree does, we couldn’t expect early adopters of Vertica to come to their database administrator jobs with much knowledge about Vertica style data modeling or StarTree style indexing.

That’s why, years ago, Vertica’s creators put the intelligence into the application, and now StarTree is doing the same thing.

Vertica Database Designer

Database Designer (DBD) is a smart little application that comes with every Vertica license. Using your existing data and historical analytic queries, it creates a bunch of possible data models.

Sample data from many tables and multiple queries feeds into Database Designer, which produces many different subsets and supersets of the data in tables sorted different ways as possible data configurations, aka multiple projection designs.

After it’s auto-generated a lot of different possible models for the data, Database Designer uses AI and experimentation to choose the ones that are ideally suited to your priorities. It optimizes how the data is stored and sorted to give the fastest query performance, even if that means storing the data multiple times in multiple configurations.

Most data models optimize data storage efficiency over query response speed. Since Vertica does the opposite, most folks assume that means the data in Vertica is hugely bloated. In fact, it usually takes up half the space that other analytical databases use — sometimes as little as 10% of the space. That’s because the AI chooses between more than a dozen different data compression algorithms to compress and encode each column for optimum storage savings.

Experimentation in the Query Optimization phase chooses best configuration for data. That leads to the Storage Optimization phase where different column encodings are chosen by experimentation to choose the most efficient storage mechanism for data. A design script is the result, or a deploy script.

When the smart application has figured out the best design for the data storage to optimize query response, and the best way to compress data to optimize storage costs, it generates the DDL to create that data model. If you want to tweak the DDL, you can. If not, the ideal, bespoke database will be created for you.

Here’s how:

  1. Fire up the application wizard and name your new data model.

2. Make some decisions about what you want to accomplish with your database:

a. Design Type: Design a database from scratch, or tweak an existing database.

Wizard interface with two selections under Design Type: Comprehensive — Database Designer creates new projections for all selected design tasks. Optionally, you can supply queries to be analyzed for further optimization. or Incremental — Database Designer creates projections for all tables referenced in queries that you supply.

b. Optimization Objective: Optimize the database design for high-speed load, high speed analysis, or balance the two.

A balance continuum with “Minimize storage footprint by creating fewer projections” on one end and “Maximize query performance by creating more projections” on the other. A green arrow indicates a the center balance point.

c. Schemas: Let it know if you want everything from your old database switched over to the new design, or if not, which schemas you want.

3. Choose some options:

a. K-Safety: Decide what level of redundancy you want built into the data to keep it safe.

b. Propose Unsegmented Projections: Vertica automatically balances partitions and shards as it segments data across nodes. If you allow it, Database Designer will also choose which tables are small enough to be replicated across nodes for highest performance.

c. Analyze Correlations Mode: This will check correlated features, and take them into account when designing the data model.

Wizard interface titled Options. Select design options below. The 3 options are the same as explained in the text above the illustration.

4. Show DDL or Create DB: This determines whether you want to look at the design Database Designer comes up with, and maybe tweak it by hand, or you just want it to create the database for you.

5. Sample Queries: Give it your query log, or a file full of queries showing what kind of work you need the database to do.

6. Let Database Designer do its thing.

Database Designer (DBD) Leverages -typical queries, sample data/schemas, and historical statistics and logs. Optimizes — query performance, data loading, and storage footprint. Benefits — faster queries, lower hardware costs, shortened design time, lower costs to maintain and optimize.

Done.

DBD creates the ideal model according to your desires, your data, and your analytics workloads. We’re not talking about a database that fits your industry. Not even a database that fits your particular business, but a bespoke, perfectly fitted database for your particular data, your particular workload, at that particular point in time.

No months’ long project, no hiring a DBA with specialized knowledge of how to configure this thing. You can learn the basics of how Vertica and Database Designer works for free, on demand.

StarTree Data Manager

The StarTree Data Manager is an application similar to Vertica Database Designer in that it uses smart code to help design the data model of the commercial version of the open source Pinot database. Now, I’m not going to be as good at explaining it because I don’t work for StarTree, and it’s only been around for about a year. Returning to my previous statement about being a pioneer: the StarTree Data Manager greatly benefits Vertica because now there is something to compare Database Designer to.

I’m mainly getting my information from the blog post StarTree did, “Announcing StarTree Data Manager” and from their website in general. Gwen Shapira plugged Pinot as an interesting database to watch at a recent Data Day Texas, where I was speaking about data architecture principles and she was talking about the things that drove her crazy about databases. That’s what put StarTree on my radar.

From what I can tell, StarTree bases streaming data table configurations on the streaming ingestion mechanism’s (Kafka, Pulsar, Kinesis, etc.) message schema. That’s a thing you can do in Vertica, but it’s more code-driven with an autoparser involved, rather than Vertica DBD which is all AI automated. Makes sense to include it in the AI-driven design tool. They talk about being able to do batch, SQL, and file data also, but it isn’t tackled in the explanation of Data Manager. So, I’ll focus on streaming data.

Basing table schema on streaming message metadata isn’t the way most databases do data modeling, so they built an application with some intelligence to make it easier. There’s a nice little animated gif on the blog post that goes through the wizard for creating tables. I’m not going to reproduce the images here because they’re not mine to use.

One smart thing that multiple databases, including Vertica and Pinot, do that I think is extra cool is infer structure from a semi-structured format like JSON. The database reads the message, figures out what the structure is likely to be, and provides a table structure that matches.

StarTree Data Manager does that. You hook to a streaming source, and get a table definition. You can then adjust data types and review what the application thinks is the structure, and approve it, or tweak it a bit, then approve it.

That’s where, in my opinion, the best part comes in for Data Manager. You can derive fields while you’re defining the table structure. Very neat. There’s a bit more manual stuff you have to do, like defining your primary keys by hand and such, that Vertica DBD does for you, but the next thing Data Manager does is unique.

Two mountain climbers at the peak of a mountain with the sun shining behind them reaching down a hand to help a third mountain climber up.

It lets you create multiple indexes with each different index sorting the pointers a different way. Like Vertica storing the data in multiple ways to make query response faster, Pinot stores the pointers to the data in multiple ways to make query response faster.

Vertica saw the problem with indexes and altered how we store data so you don’t need them. StarTree saw the problem with indexes and improved on the concept. I think its telling that two different companies found two different strategies to solve a problem that most folks, including the industry analysts, seem to think is an advantage.

Why can’t databases use AI to self-tweak their data models as data and analytics workloads shift?

Data models are not static. Database designs have to shift as data or workloads change. So, doing this once isn’t enough.

At Data Day Texas, during the brilliant Gwen Shapira’s talk about the things databases didn’t do that frustrated her on every project, she mentioned this as one of her pet peeves. In an ideal world, Shapira commented, databases would: “Tune workload intelligently. Know and analyze its own workload. And modify the index and other data structures to tune automatically.”

Intelligent and Adaptive — blue text on a gray background.
Pic Paige took of Gwen’s slide while live tweeting her talk at Data Day.

My first thought was, well, there is one database that does that already: Vertica. It’s the same smart app, Database Designer, and works the same way. You just choose the “Incremental” selection on the wizard. The smart app evaluates the changes needed to make everything perform well again, and shows you the DDL you need to use, or just deploys it if you’re cool with that. You get a bespoke database optimized for the workload and data you have at that time.

The one thing it doesn’t currently do is monitor usage and performance, and tell you that you need to tweak the data model to adapt for data drift and workload shifts. Most folks usually notice, though, when performance is not as good, or SLAs are getting missed. And Vertica does provide all the information you need to make that determination yourself before it overly impacts the user base. I’m thinking we should add that to the roadmap, though.

I don’t know if StarTree has that capability, but if not, I bet it’s on their roadmap. They seem to be on top of the problems that a lot of other folks missed, the ones that frustrate experienced data management people on every project they do.

It’s hard to explain all this to analysts who wonder why this analytical software doesn’t just check the boxes for indexes and pre-made industry data models like everyone else. I have a lot of respect for the folks making their living analyzing dozens of technologies across an entire industry, but when you’re trying to understand many different techs, it’s easy to make a mistake. Finding a technology that’s different isn’t a sign that it is inferior.

Pioneers blaze trails. Everyone else, eventually, follows.

Two mountain climbers at peak of a mountain with sun setting behind them cheer on three other climbers as they help each other to the top.

--

--

Paige Roberts

27 yrs in data mgmt: engineer, trainer, PM, PMM, consultant. Co-Author of O’Reilly’s : "Accelerate Machine Learning" “97 Things Every Data Engineer Should Know”