What Analytics DB to use when you need to crunch several hundred million row dataset in real-time

Robert Fehrmann
Snagajob Engineering
2 min readJun 19, 2014

--

A couple of month ago we were looking for new ways to allow end users (a couple of thousand external users) to crunch through their detailed data in real time as well as enabling internal users and data analysts to gain the information they needed to run and optimize their business processes.

Unfortunately our current system had become slower and slower over time due to the tremendous increase in data to be managed so a new approach had to be taken to accomplish this goal. Our existing data warehouse/data management infrastructure just could not handle big data.

We evaluated a variety of different solutions such as Amazon Redshift, Infobright and Microsoft. Vertica won out above all these other solutions. Our dataset is several hundred million rows and our avg. response time goal was less than 5 secs. We are building our environment for the future so another requirement was to be able to scale horizontally.

Redshift came close in response time but failed in concurrency, meaning multiple users running an analysis at the same time. Infobright came close in response time and concurrency but didn’t provide sufficient scalability. Vertica checked all boxes at a very competitive price-point.

We found that the extreme speed, performance and flexibility is superior to all the other solutions out there. The massive scalability on industry-standard hardware, standard SQL interface and database designer and administration tools are excellent features of Vertica. I also really value the simplicity, concurrency for hundreds or thousands of users, and aggressive compression.

This new environment allowed us to implement applications such as clickstream and predictive analysis which have added tremendous value for us. Currently there is about 500 GB — 1 TB of data that I am managing and I have found that Vertica is able to be integrated very well with a variety of Business Intelligence (BI), visualization, and ETL tools in their environment. I use Hadoop, Tableau and Birst and using all these solutions with Vertica has been overall quite smooth.

Our query performance has increased by 500–1,000% through improvements in response time and I am now able to compress our data by more than 50%. The simultaneous loading and querying and aggressive compression has helped us become more efficient and productive. Furthermore the high availability without hardware redundancy, optimizer and execution engine, and high availability for analytics systems has saved us both time and money.

Originally published at engineering.snagajob.com on June 19, 2014.

--

--