TileDB 2.0 and the Future of Data Science
--
The Helicopter View
TileDB is an open-source and cloud-native storage engine for chunked, compressed, multi-dimensional arrays. It introduces a universal data format, general enough for all application domains, and with built-in data versioning. It offers many APIs and data science tool integrations. Today we announce TileDB 2.0, a huge milestone for our community and company.
For those already familiar with TileDB, the new features in version 2.0 are:
- Support for heterogeneous dimensions in sparse arrays
- Support for string dimensions in sparse arrays
- Google Cloud Storage and Azure Blob Storage support
- Completely revamped R API
These features mark an important step in our journey to deliver a universal data storage engine for all applications and data science tools, and to shape the future of Data Science.
TileDB 2.0 Now Handles Dataframes
Time to move on from legacy columnar file formats.
TileDB is an array storage engine that offers super-efficient multi-dimensional (dense and sparse) array slicing. But can it handle dataframes?
It is fairly easy to see how a dataframe is equivalent to a sparse array. Choose any subset of dataframe columns (the ones selected for fast slicing) and define them as the “dimensions”. Then your dataframe records become points in a sparse multi-dimensional space, i.e., a sparse array. Therefore this is an ideal case for TileDB.
Until now TileDB supported homogeneous dimensions, i.e., dimensions with the same data type. This worked well for data like LiDAR (3D points with Double coordinates). However, we realized it was limiting for dataframes where one may wish to slice on columns with different types, such as Date (Datetime) and Price (Double). Moreover, many dataframes also have String columns that users need to slice (e.g., Name or StockSymbol). TileDB 2.0 adds heterogeneous and string dimensions, now fully supporting dataframe use cases.
Data scientists have been using popular columnar dataframe formats for a while now. The importance of adding dataframe support to TileDB 2.0 stems from the other features that make TileDB a great storage engine and more than a flat file format. More specifically:
- TileDB efficiently supports data versioning natively built into its format and storage engine. Other formats do not support data updates or time-traveling; you need to build your own logic at the application layer or use extra software that acts like a database on top of your files.
- TileDB implements a variety of optimizations around parallel IO on cloud object stores and multi-threaded computations (such as sorting, compression, etc).
- TileDB is also “columnar”, but it offers more efficient multi-column (i.e., multi-dimensional) search.
- With TileDB, you inherit a growing set of APIs (C, C++, Python, R, Java, Go), backend support (S3, GCS, Azure, HDFS), and integrations (e.g., Spark, MariaDB, PrestoDB, Dask), all developed and maintained by the TileDB team.
TileDB 2.0 is more than a columnar format; it is a full-fledged data management solution, which evolves from monolithic databases and acts as an embeddable library that works with all your favorite data science tools.
A Universal Data Engine
Data scientists are at a crossroads. On one hand, there are multiple data formats across application domains and domain-specific software for parsing and analyzing the data. On the other hand, data scientists want to embrace a growing set of generic open-source data science tools for fast analysis, visualization and scalability. The challenge they face is that their data needs to be converted from a legacy format into “something” that the generic tools can understand, which is a long, cumbersome and inefficient process.
Take any popular data science tool. You will notice that it expects the data in one of two data formats. Either a dataframe (i.e., a table) or a multi-dimensional array (e.g., a feature vector, a 2D image, a 3D video, etc). For example, all traditional relational databases and tools like Pandas and data.table operate on dataframes; NumPy, xtensor and all machine/deep learning tools based on linear algebra work with arrays.
What about key-value stores? A key-value store can be thought of as a dataframe with one or more “key” columns and one or more “value” columns, where fast searching (slicing) on the key columns is important.
Therefore, the vast majority of applications and tools work on dataframes and arrays. And as explained above, dataframes are also arrays. The addition of dataframe support coupled with a cloud-optimized format and built-in data versioning and metadata handling makes TileDB 2.0 the only storage engine delivering a universal data format in a single open-source, embeddable library.
Built for Data Scientists
Extreme interoperability and the new TileDB R API.
A storage engine, no matter how powerful, is not meaningful as a standalone library. Data scientists use analysis tools to make scientific discoveries fast and do not really care about files, formats, or storage backends. Also, every data scientist has her own preferences in programming languages and tools. Towards this end, we put enormous effort to build and grow the set of TileDB APIs (C, C++, Python, R, Java, Go) and integrations (Spark, Dask, MariaDB, PrestoDB, GDAL, PDAL).
TileDB 2.0 brings you a completely revamped TileDB R API. We are fortunate that Dirk Eddelbuettel, the (co-)author / maintainer of several popular R packages and a board member of the R Foundation, joined us to pull this off. We want to make TileDB an integral part of the R ecosystem and are just getting going on integrations with other key R packages, such as the tidyverse and Bioconductor.
To the Clouds and Beyond
Managing truly cloud-ready data.
We see more and more of our users and customers moving to the Cloud for affordable storage and scalability. TileDB 2.0 adds support for Google Cloud Storage and Azure Blob Storage to the existing AWS S3 support.
TileDB abstracts IO for all the supported backends. Cloud object stores present extra challenges compared to traditional filesystems, such as eventual consistency, object immutability, added network latency and cost, and more. TileDB is optimized for such cloud object stores, taking care of all the details — efficiently updating your arrays and managing the underlying cloud store objects. Moreover, TileDB does this without compromising the performance on other backends, such as the filesystem on your laptop or a distributed filesystem like Lustre and HDFS.
Most importantly, even as new filesystems emerge, the TileDB abstracted architecture is extensible to seamlessly support the new filesystems, allowing you to use the same TileDB APIs and integrations.
The Future
A full circle? Not really.
Monolithic Databases in the ’70s, legacy file formats in 2020. Are we going back to the ’70s by building something more than a flat file format? Not at all! With TileDB we are bridging the best of two worlds.
We all understand that storage and compute must be separate, so that they can scale independently. And storage must be open-source and interoperable. Also, it is clear today that compute must be more than just SQL. TileDB is a full-fledged data engine that deals with “all things storage”: compression, slicing, efficient IO, versioning, metadata. These remain constant factors whether you use a relational database or a tool like Spark, Dask, etc. So we decided to build this storage engine once, super-efficiently, and for all to use.
The next challenge is compute. Data scientists have different preferences in languages and tools, so we chose to integrate with everything. You can perform SQL with MariaDB, PrestoDB, and Spark, or run distributed linear algebra with Dask, all on the same TileDB array, avoiding multiple data conversions and copies.
Our ongoing work takes this further: we are constantly identifying common operations across languages and tools and pushing them down to TileDB. For example, a filter operation is common across MariaDB, Spark, Dask, and NumPy; we realize that there is no need to implement it multiple times. We are working on bringing such operations closer to the storage for improved performance.
The future of data science is here. Our vision as both software developers, and collaborators with the data science community, is to minimize wasted hours in data wrangling. Superior performance and ease-of-use starts with multi-dimensional arrays and common APIs built around arrays. TileDB 2.0 is a critical milestone. Check us out on Github and join us on this journey.