Files, RDBMS, Elasticsearch, Druid, MapD

Cache Storage Options for your Next Custom Visualization

ajo
4 min readDec 20, 2017

There are so many options for supporting fast interactive Custom Viz apps. I’m going to cover some that I’ve looked into or actually used in my projects.

Precomputed Files

If the set of dimensions and measures are small and the number of rows are small, you can reasonably pre calculate several mini cubes into files (often JSON format). For a personal project, I also developed a client side OLAP library to further filter/pivot the data in the browser. You have to consider bandwidth and file size in this case. Larger files will take longer to download.

Pros

  • No external setup dependencies
  • Super snappy for small data scenarios
  • Quick to get up and running
  • Free. No additional software expenses

Cons

  • File size has to be limited for fast downloads which leads to the generation of numerous files. This could become difficult to manage.
  • Harder to debug data issues within files without moving it to some queryable interface
  • OLAP functions such as filtering, grouping, windowing etc has to be hand coded

I plan to refactor my client side OLAP lib and open source it someday soon. Stay tuned for that by following me on medium or twitter.

Relational DB (Postgres)

When we are talking about dashboards or Custom Visualizations we are talking about a subset of your data. This typically ranges from 2GB to 10GB and sometimes more. Relational DB’s can provide sub second response times if architected for the dashboard in question. One of the gotchas are going to be table width. The wider your table the more IO will be needed to process your data. An RDBMS like Postgres stores data in a row oriented fashion and as such will retrieve the entire row from disk to complete the request.

Pros

  • Well known technology and ecosystem (drivers, connectors, and automation tooling)
  • SQL
  • Builtin analytical functions (windowing, stats etc)
  • Can join across many tables
  • Performance can be sub second for well architected systems
  • Battle tested for decades

Cons

  • You must spend time planning, tuning, and optimizing your DB for your Viz use case. It can be done, but it takes some effort.
  • RDBMS are built for transactional use cases and are typically row oriented storage. As a result, analytical workloads are rarely sub second for bigger data volumes.
  • Poorly planned and architected systems will be very slow

Elasticsearch

If you can afford it, Elasticsearch is one of the fastest options out there. Some of our indexes gets sub 100 millisecond response times. While it will perform well in many scenarios, it has a steep learning curve.

Pros

  • Superfast!
  • HTTP endpoints make it easy to query from any programming language
  • Great documentation and well known in the big data community
  • Schema-less design makes it easy manage and update

Cons

  • Analytical workloads often involve aggregations. Elasticsearch aggregation syntax is convoluted
  • Aggregated results are not simple array of arrays. They are deeply nested JSON objects where the depth is dependent on the number of fields in the “group by”
  • Limited aggregation functions
  • Designed to deployed on a distributed cluster adding to infrastructure maintenance and management
  • Expensive

Druid

Druid is rapidly emerging as the go to OLAP solution for major big data operations. It is built for distributed storage of data in memory but optimized for OLAP workloads. This means fast ingestion and aggregations. Druid claims Petabyte scale performance and is deployed at companies like Netflix, Alibaba, PayPal, and more.

Pros

  • Designed for OLAP workloads
  • Distributed in memory architecture makes it possible to scale to Petabytes of data
  • Well adopted by major companies including Netflix etc.
  • Extremely fast queries
  • Runs on commodity hardware
  • Can use Javascript to transform and filter data

Cons

  • Complex query syntax
  • Very minimal join capabilities
  • Only basic aggregation functions ( sum,avg,count,max,min )

MapD

MapD is an in memory columnar DB that can utilize GPU for even faster query processing. It even supports full on SQL! I found it extremely promising as a storage layer than can ride with your Custom Visualization. You can use the community edition or upgrade to an enterprise option.

Pros

  • Full SQL support including windowing functions!
  • Very fast for certain data sizes making it optimal for Custom Viz workloads
  • Super easy to setup and load data from your existing data warehouse (SQL Importer tool is great)

Cons

  • New kid on the block with not many notable customers yet
  • Connector ecosystems needs improvement
  • A few rough edges that will surely improve with time
  • Poor documentation regarding HTTP api

Read these for more on the Custom Visualization wave sweeping Silicon Valley and beyond:

Ajo Abraham is a big data expert known for building beautiful, fast, and high impact custom viz apps. For consulting requests you can email him here ajo@veroanalytics.com

--

--

ajo

Ajo Abraham is a Big Data and Visualization expert