Files, RDBMS, Elasticsearch, Druid, MapD

Cache Storage Options for your Next Custom Visualization

4 min readDec 20, 2017

There are so many options for supporting fast interactive Custom Viz apps. I’m going to cover some that I’ve looked into or actually used in my projects.

Precomputed Files

If the set of dimensions and measures are small and the number of rows are small, you can reasonably pre calculate several mini cubes into files (often JSON format). For a personal project, I also developed a client side OLAP library to further filter/pivot the data in the browser. You have to consider bandwidth and file size in this case. Larger files will take longer to download.

Pros

No external setup dependencies
Super snappy for small data scenarios
Quick to get up and running
Free. No additional software expenses

Cons

File size has to be limited for fast downloads which leads to the generation of numerous files. This could become difficult to manage.
Harder to debug data issues within files without moving it to some queryable interface
OLAP functions such as filtering, grouping, windowing etc has to be hand coded

I plan to refactor my client side OLAP lib and open source it someday soon. Stay tuned for that by following me on medium or twitter.

Relational DB (Postgres)

When we are talking about dashboards or Custom Visualizations we are talking about a subset of your data. This typically ranges from 2GB to 10GB and sometimes more. Relational DB’s can provide sub second response times if architected for the dashboard in question. One of the gotchas are going to be table width. The wider your table the more IO will be needed to process your data. An RDBMS like Postgres stores data in a row oriented fashion and as such will retrieve the entire row from disk to complete the request.

Pros

Well known technology and ecosystem (drivers, connectors, and automation tooling)
SQL
Builtin analytical functions (windowing, stats etc)
Can join across many tables
Performance can be sub second for well architected systems
Battle tested for decades

Cons

You must spend time planning, tuning, and optimizing your DB for your Viz use case. It can be done, but it takes some effort.
RDBMS are built for transactional use cases and are typically row oriented storage. As a result, analytical workloads are rarely sub second for bigger data volumes.
Poorly planned and architected systems will be very slow

Elasticsearch

If you can afford it, Elasticsearch is one of the fastest options out there. Some of our indexes gets sub 100 millisecond response times. While it will perform well in many scenarios, it has a steep learning curve.

Pros

Superfast!
HTTP endpoints make it easy to query from any programming language
Great documentation and well known in the big data community
Schema-less design makes it easy manage and update

Cons

Analytical workloads often involve aggregations. Elasticsearch aggregation syntax is convoluted
Aggregated results are not simple array of arrays. They are deeply nested JSON objects where the depth is dependent on the number of fields in the “group by”
Limited aggregation functions
Designed to deployed on a distributed cluster adding to infrastructure maintenance and management
Expensive

Druid

Druid is rapidly emerging as the go to OLAP solution for major big data operations. It is built for distributed storage of data in memory but optimized for OLAP workloads. This means fast ingestion and aggregations. Druid claims Petabyte scale performance and is deployed at companies like Netflix, Alibaba, PayPal, and more.

Pros

Designed for OLAP workloads
Distributed in memory architecture makes it possible to scale to Petabytes of data
Well adopted by major companies including Netflix etc.
Extremely fast queries
Runs on commodity hardware
Can use Javascript to transform and filter data

Cons

Complex query syntax
Very minimal join capabilities
Only basic aggregation functions ( sum,avg,count,max,min )

MapD

MapD is an in memory columnar DB that can utilize GPU for even faster query processing. It even supports full on SQL! I found it extremely promising as a storage layer than can ride with your Custom Visualization. You can use the community edition or upgrade to an enterprise option.

Pros

Full SQL support including windowing functions!
Very fast for certain data sizes making it optimal for Custom Viz workloads
Super easy to setup and load data from your existing data warehouse (SQL Importer tool is great)

Cons

New kid on the block with not many notable customers yet
Connector ecosystems needs improvement
A few rough edges that will surely improve with time
Poor documentation regarding HTTP api

Read these for more on the Custom Visualization wave sweeping Silicon Valley and beyond:

Ajo Abraham is a big data expert known for building beautiful, fast, and high impact custom viz apps. For consulting requests you can email him here ajo@veroanalytics.com

Cache Storage Options for your Next Custom Visualization

Precomputed Files

Pros

Cons

Relational DB (Postgres)

Pros

Cons

Elasticsearch

Pros

Cons

Druid

Pros

Cons

MapD

Pros

Cons

Written by ajo