Using Vertica at MindGeek

Published in

MindGeek Engineering Blog

3 min readApr 10, 2019

Big Data! It’s massive, comes in different formats, and is under constant change. Companies are now leading their resources to collect more user related data off the web, be it through their branded applications or social media platforms.

Traditionally we have used tools ranging from row-store SQL databases such as MySQL and PostgreSQL, NoSQL databases such as Elasticsearch, and batch-processing based tools such as Hadoop Hive. Needless to say, one of the main challenges we faced was presenting the final transformed and analyzed data to our customers in real time. Databases such as MySQL and PostgreSQL struggle with hot and highly volatile data as they lack strong memory-focused search engines which leads to periodic bottlenecks and performance issues. Adding in-memory key-value stores such as Redis or Memached can certainly help, but they come with the extra headache of adding more layers to the system that require maintenance and upkeep.

Elastic Search, on the other hand, requires a very high level of de-normalization so each document can be self-contained, which as a result delays the data delivered in real time to the customer. Hadoop Hive is great, but since its primary power resides in its batch processing capabilities, it is certainly not a good candidate for fast-running queries.

Among the latest contenders we chose was Vertica, “an advanced SQL analytics column-store database that maximizes cloud economics for mission-critical big data analytical initiatives”, at least this much we were promised before using it. Given that our data is highly relational, we were determined to find a schema-agnostic mature ecosystem that offers us the following:

- Fast and highly performing scans on large data sets with relatively few columns selected

- Summarizing billions of data rows and returning them in sub-second

- Bulk-loading of data with high compression rates

- Reduced cost of ownership

- High concurrency rates

Having achieved those objectives with Vertica, we found another feature to be unique and valuable to our needs: Projections. They are actual data storage mechanisms that behave similar to clustered indexes of the data which can be custom sorted to allow for better compression and fast retrieval of information. Other features such as MPP architecture and cascading resource pools are currently helping us sustain the system resourcefully for multiple teams in the company.

We are currently running Vertica version 8.1.1, set up over 7 nodes on AWS C5 instances. Our next planned move would be testing an upgrade to Vertica EON mode, which would allow us to scale and shrink rapidly in response to our changing workloads.

The domain of Big Data is forever expanding, and within it the competition among its technology providers. We are continuously looking for ways to improve on our existing Big Data ecosystem and driving it forward to better meet the needs of our customers.

Using Vertica at MindGeek

Written by MindGeek Engineering