TileDB is a new database (DB) invented to help Data Science teams make faster discoveries by giving them a more powerful way to store, update, analyze, and share large sets of diverse data. TileDB is comprised of a new multi-dimensional array data format, a fast embeddable, open-source C++ storage engine with numerous Data Science tooling integrations, and a cloud service for easy data management and serverless computations, all architected to solve common pain points faced by data scientists. In this article, we describe the value TileDB is bringing to Data Science and answer the question of whether Data Science really needs a new DB.
Need for a New Data Format
There exist numerous data formats in a variety of application domains. So why does Data Science need a new format? We observe three main problems with current data storage approaches.
- Not cloud-optimized. Data scientists typically form large data lakes in cloud object stores, like AWS S3, due to their scalability and low cost. It is also common to separate storage from compute in order to scale them independently, optimize cloud usage, and reduce costs. Cloud object stores pose new challenges such as object immutability, eventual consistency, and IO request limiting. These challenges render most formats impractical for use in the cloud, as they were not originally designed with cloud storage in mind.
- Lack of support for data updates. Not all existing formats support efficient updates. For example, updating a Parquet file requires the creation of a new file, pushing the entire update logic to the user’s higher-level application. The same problem is encountered in genomics with population gVCF files that need to be reconstructed from scratch with every new sample insertion (often called the “N+1 problem”). This issue manifests whenever the update logic is not built into the format and storage engine, but it is rather delegated to higher-level applications.
- Limited scope. Serious analytics and scientific software compute on a multi-dimensional array (e.g., for linear algebra, the backbone of machine/deep learning) or a dataframe (e.g., for OLAP operations). We have demonstrated that a dataframe, in its full generality, is equivalent to a sparse array. Everything can be modeled as either dense or sparse arrays; geospatial data is 2D or 3D arrays, video is 3D, population genomics is 2D (one dimension for the genomic positions, one for the samples), relational tables (being dataframes) are arrays, and even graphs are arrays (when modeled as adjacency matrices). HDF5 and Zarr are popular dense array formats, but do not natively support either sparse arrays or dataframes. Parquet is a useful dataframe format (excluding updates), but it cannot naturally model dense arrays. Most Data Science applications require at least two separate file formats to handle both array data and dataframes. Is it possible that both could be handled by a single file format?
We took a bottom-up approach to Data Science with TileDB, starting from the storage layer. TileDB introduces the only format and storage engine (open-source under the MIT License) that handles both dense and sparse multi-dimensional arrays. It supports efficient array IO on multiple storage backends, including cloud object stores like AWS S3. One important feature is its rapid, highly parallel, lock-free, batch updates that are architected to work particularly well on the cloud with immutable objects. All update logic and functionality (like time traveling) is built into the format and storage engine. TileDB accommodates all Data Science applications with a single format and a unified, intuitive API.
Ecosystem Integration is Vital
The true power of data scientists lies in the numerous computational tools in their arsenal (SQL engines, Spark/Dask for distributed execution, Tensorflow/PyTorch/etc for machine/deep learning, and many more), and the wide selection of programming languages. The storage engine must be interoperable with pretty much all the computational tools and languages. And when we say interoperable, we really mean efficiently integrated in a way that the data can be fetched from the storage medium to the tool internals with the minimum possible overhead (e.g., via zero-copying). Unfortunately, most of the storage engines do not offer efficient language bindings.
Moreover, many popular DBs do not support external (pluggable) storage, relying only on their own, often proprietary, format. As a result, the user ends up manually converting data from one format to another, which reduces productivity and considerably impacts performance. Most DBs also do not support user-defined functions (UDFs) written in languages like Python and R. The few DBs that do support multiple languages come with constraints such as not permitting the full generality of user-defined table functions (UDTFs). The Apache Arrow project takes some good steps towards specifying an interchangeable in-memory data format that can be shared across tools, but it is yet to be adopted by the majority of DBs and other Data Science software.
TileDB offers a standalone, embeddable C++ library that ships with efficient APIs in C, C++, Python, R, Java and Go, and enables direct access to TileDB arrays (instead of typically slow ODBC/JDBC access). This library is also integrated with Spark, Dask, PrestoDB, MariaDB, Arrow and popular geospatial libraries like PDAL, GDAL and Rasterio. TileDB takes one step further and, while it allows you to compute natively with your popular tools, it pushes-down as much computation as possible to storage, such as filter conditions from the SQL engines, dataframe computations from Dask and Spark, etc. This leads to performance boost due to fast processing in C++, and the minimization of data copying. Storing your data in TileDB, you can take advantage of the entire Data Science ecosystem, departing from old monolithic and domain-specific solutions.
Sharing Data is Hard
Access control is important when you wish to share portions of your data within your organization or with other users on the cloud. Serious DB solutions provide advanced access control features. The problems begin when you do not really need a full-fledged DB. In that case, (i) you end up paying for the DB just for access, since you may not be utilizing all the other DB features (e.g., joins, group-by queries, etc), and/or (ii) slicing data from the DB may be slow or not scalable, as it will typically have to go through ODBC/JDBC connectors, which may be much slower than direct data access with a lightweight storage engine.
The alternative is not to use a DB at all. But in that case, who will dictate access, especially if you use multiple computational tools on the same data? One approach is to use something like IAM roles on AWS, if your data is on S3. However, those roles have file semantics; this means that you need to know which files and which byte ranges collectively correspond to a specific logical portion of your data. This can get cumbersome or even unattainable, especially in the presence of updates and compressed/encrypted files. This leaves you with pushing all the access control logic to higher-level applications, which all end up implementing more or less the same access control mechanism.
We observe the same issue with logging. Being able to log everything that happens on your data is important for auditability and getting business intelligence insights. The DBs are excellent at this, but you face the same problems as mentioned above. You can alternatively use the logs of your AWS S3 bucket, but those logs are indecipherable if they just record the byte ranges of an object PUT or GET request.
To address these problems, we developed a new product called TileDB Cloud. With TileDB Cloud, you are able to properly manage your arrays on the cloud, and easily share them inside your organization or with other users globally, while monitoring all activity. The key is to push access control and logging down to storage, so that all higher-level tools can inherit it. It is the array abstraction that makes it truly easy and intuitive for you to share any kind of data (dataframes, geospatial, genomics, time series, etc). Slicing arrays works natively just as in the case of the open-source storage engine and all its integrations transparently to the user, via fast REST and zero-copying wherever possible. TileDB Cloud is scalable and elastic, and comes with a pay-as-you-go pricing model. If you wish to run TileDB Cloud under your full control in your own private cluster, you can enjoy all the features of the cloud service with LDAP and SAML support by using TileDB Enterprise.
Deployment is a Hassle
Building and deploying software has always been a pain, but it is now aggravated by the diversity of the tooling the data scientist uses. Fortunately, containerization and the ease of spinning up/down resources on the cloud have helped mitigate this pain. Things get tricky when (i) configuring and fine-tuning the software (e.g., Spark, Dask, a DB, etc), (ii) updating package dependencies (e.g., a new Python package update may break the entire build), and (iii) using DBs that work only with one Python version, thus certain packages may not be supported at all. Moreover, moving analysis from prototype to scalable production (from single nodes and small prototype datasets, to on-demand, cost-effective cloud resources) requires data transformation and adoption of a different data access stack via centrally managed resources. Organizations find themselves either hiring a separate engineering / SysAdmin team to manage those tasks, or have the data scientists themselves wasting most of their time doing so.
With the TileDB storage engine, you can test your code on your laptop with a small array, and then just change the name to point to a multi-TB array on AWS S3 and the whole code continues to work at scale. Furthermore, with TileDB Cloud you can perform serverless SQL and Python UDFs on TileDB data stored on S3. Everything works elastically and in a pay-as-you-go fashion. No need to spin up or tear down machines and build complicated packages. Just install the TileDB client, sign up, and go. We are hard at work to add more serverless functionality, such as deploying sophisticated and diverse workflows in multiple programming languages. Stay tuned!
Data Science is fascinating and data scientists are making remarkable discoveries in numerous domains. TileDB introduces a new DB solution that helps them continue to do so while focusing more on the Science and much less on the Engineering.
But does Data Science need a new DB? One thing is for certain, Data Science needs a new storage engine. Being able to store all types of data in a native, cloud-optimized array format that all the higher level tools can use efficiently, and with fast update functionality built in, is a considerable leap forward to address the data storage pains. If you are not worried about data management and deployment, with the open-source TileDB Developer toolkit you are all set, i.e., you do not really need any further DB functionality. If however you find it difficult to share data with others on the cloud with meaningful and intuitive semantics, or time-consuming to deploy and auto-scale your tools in a cost effective manner, then you need a new kind of DB. New in the sense that it should no longer be monolithic and/or associated just with tables/dataframes and SQL as in the past. This new DB should allow you to manage your data at the storage level, while enabling you to use the familiar Data Science tools of your choice. It should also enable painless deployment and collaboration, especially on the cloud, which traditional DBs were never meant to address. We envision TileDB Cloud as this new DB for data scientists.
We will follow up with more focused, technical blog posts. In the meantime, the best way to get started is to check out our webpage and docs, give TileDB Cloud a go, or just say hello and tell us about you and your work.