The Top 7 Databases for Machine Learning

Adam Carrigan
5 min readJan 19, 2022

--

One of the most common questions I get asked is, ‘What is the best database for Machine Learning?’ In reality, the answer I give is nearly always, ‘It depends’, before bombarding the enquirer with a series of follow-up questions. But because ‘It depends’ is never fun in blog form I have put together this list.

Machine Learning has now penetrated every aspect of our lives whether you realise it or not. From the video recommendations you see on YouTube, the systems keeping you safe while you bank or shop online all the way to the processing of images on your smartphone. And still, a relatively small number of companies outside the large behemoths like Google, Chase and Apple have yet to implement Machine Learning in any significant way.

This is about to change. With the plethora of new ways to do Machine Learning and the cost of implementation and maintenance dropping significantly over the past few years smaller businesses will soon be implementing Machine Learning on the data they have already been collecting (in many cases for years).

The number one key ingredient to Machine Learning is of course data, and the vast majority of usable data is stored inside databases. While the first database was invented in the 1960s databases have come a long way since their rudimentary past, however, they were never conceived or built with Machine Learning in mind.

Thus, not all databases were created equal.

Whether you want to do Machine Learning right away or just want to future proof yourself: the below databases are excellent for Machine Learning purposes.

QuestDB:

Time Series is eating the (Data) World. You may hear the statistic that more data will be created this year vs all the years before combined, but what you might not know is that an increasing proportion of that data is Time Series. In fact, the number one business use-case for Machine Learning is, you guessed it, Time Series. This is one of the reasons Nicolas and Vlad over at QuestDB created a new database.

Singlestore:

Let’s be honest, Machine Learning requires lots of data and depending on the problem it can be crazy volumes of data. For example, OpenAI’s famed language model GPT-3 has around 175B parameters trained on 45TB of text data. For this kind of data you need a seriously fast and scalable database. Enter Singlestore, which is, as their marketing team puts it, ‘The Database for the Data-Intensive Era’. Some of the most data-intensive companies around use Singlestore; including Uber, Cisco and hulu.

Clickhouse:

Clickhouse by the famed Russian search engine Yandex is taking the world by storm. An offshoot just recently raised $250m from illustrious VC’s Index and lightspeed. It has also spawned amazing companies like Altinity.

Believe it or not, Spotify is one of the leading employers of Machine Learning engineers in Europe. Even wondered why their song recommendations are so good? It’s thanks to the hard-working folks in Spotify’s ML team. And you guessed it: Spotify uses Clickhouse!

MindsDB:

While technically not a database (although it looks and acts like one), MindsDB enables you to add Machine Learning capabilities to any database. MindsDB works with every database on this list in addition to all the well-known ones like MySQL, Postgres etc. It also works with the likes of Snowflake and Redis and it’s open-source! They are backed by YCombinator and the founders of MySQL and MariaDB.

Datastax & Cassandra

The Apache Cassandra based Datastax is going head-to-head with Kafka on streaming and doing pretty well. Doing Machine Learning on streams is a reasonably new thing but the potential to change the way businesses use data is phenomenal. The folks over at Datastax have ML squarely in their sights with many of their largest customers implementing ML from the ground up.

MariaDB

From the team who brought you MySQL comes MariaDB. At the recent MariaDB server fest, an increasing number of speakers were heavily focused on Machine Learning and it is a big priority for both the MariaDB foundation and those who are building SkySQL; their cloud offering.

Little known fact — both MySQL and MariaDB are named after the founder’s children (My & Maria), there is also MaxDB named after you guessed it, his kid Max. Love your work, Monty.

MongoDB:

MongoDB is a fan favourite among developers and startups, and has grown from a niche player to the behemoth it is today. While they are now heavily focused on their cloud offering, MongoDB Atlas, they have not forgotten their open-source roots building one of the best NoSQL databases around. You probably don’t associate NoSQL with Machine Learning but the team over at MongoDB know the importance of this, working closely with their customers on many ML use-cases.

This list is by no means comprehensive and I’m sure there will be much debate in the comments. Feel free to follow me on twitter.

--

--

Adam Carrigan

Co-Founder at MindsDB, YCombinator & UC Berkeley SkyDeck Alum, former management consultant at Deloitte, University of Cambridge grad!