How we process and store terabytes of data

Published in

Pygmalios Engineering

5 min readOct 16, 2019

Pygmalios analytics platform is collecting data from various sensors installed in stores and receiving thousands of events every minute. And it is still growing so we have to scale. Thats a lot of messages, but every message has to be stored so we can provide analytics on both current trends and historical data. Every stored message means disk space so we need to be effective. I don’t mean only read/write performance but also cost effectiveness.

Data pipeline

We use lambda architecture for data processing and our system is consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries.

Messages are stored as raw data, then processed to sessions stored in processed data and finally aggregated and stored in aggregated data for serving layer.

We define three data layers in our system:

raw data: data from sensors
Imagine person x, y coordinates for every second in store.
processed data: grouped data in session format
Imagine shopping session with start and end timestamp and trajectory.
aggregated data: metrics in hour or day intervals
Imagine shopping zones inside the store with dwell times/visit count metrics.

Kafka

Every message collected from our sensors or other cloud providers is stored in Kafka cluster first. This is the gateway to our system for real-time data processing. It is fast, fault-tolerant and horizontally scalable.

Data collection is separated to different Kafka topics as we use various data sources. All messages are collected in real-time as they come from sensors and include source ids and data collection timestamp in the payload.

Then they are stored in Cassandra tables as an original raw data by Spark streaming jobs where are sorted, deduplicated and partitioned by client. We use 30 days retention period for Kafka topic messages. This means every message received 30 days ago is removed from log. It is enough time to store all data into long-term storage even during high load. Everything is monitored and under control.

Spark and Cassandra

As Kafka is a gateway to Pygmalios cloud, Spark and Cassandra is a heart. We use Cassandra cluster and all tables are also replicated across nodes using NodeSync service. It is fully automatic, no manual intervention are needed and also provides repair service.

Storing Raw Data

Kafka Topic messages are read in real-time to our Cassandra database and saved in raw data tables by Spark streaming jobs. Casandra has great write performance and is horizontally scalable so we are ready to increase throughput by adding additional nodes. We store whole history, every message received, to SSD disks in Google Cloud Platform. This data are automatically partitioned by client venue and day, sorted by timestamp and deduplicated by Cassandra clustering keys.

Raw Data Backups: Google Cloud Storage

We are transferring raw data from Cassandra tables to Google Cloud Storage buckets every day in Spark night batch jobs. We read whole Cassandra partitions from last days as data frames and transform them to parquet file format which has direct Spark support and provides efficient data compression so it is great for storing terabytes of data. This jobs are monitored and we can see if everything is ok.

You can even use various storage classes for Google Cloud buckets for better cost effectiveness depending on your needs.

The restore service is however a key benefit of cloud storage. We can load backed up data directly from cloud storage bucket and transfer them back to Cassandra or recalculate metrics directly.

And the last benefit of this backup service are statistics. We can track daily usage for all Pygmalios analytics modules by our customers and see how many events occurred and how they perform.

Processing Data

Data are processed both in real-time and in every-day batch jobs, grouped by sessions and stored in processed data Cassandra tables. They are used for final aggregations and stored in aggregated data tables. Aggregated data are relatively small, but thats our goal. All jobs are also monitored and we can see everything runs smoothly.

We also use separate rest API for batch jobs so we can re-run them as we need.

Clearing old data

We can’t store a whole history on Cassandra SSDs, it is expensive and nobody would pay for it so we have to be cost effective. We have separate cleaning batch Sparks jobs and automatically run them on a regular base as we have all raw data backed up in the cloud storage, which is cheaper. We delete all raw and processed data older then 3 months from Cassandra and keep only aggregated data in Cassandra database. We also have batch jobs ready for GCS data read so that is no problem in case of historical data recalculation. It only have to be a part of premium service as it is more expensive.

Aggregated Data Backups: Datastax OpsCenter

We use Datastax backup and restore service for aggregated data only. We have to restore data very fast for our customers in case of emergency. I hope this won’t happen but we have to be ready and perform regular tests. Aggregated data are relatively small so it is easy to achieve it.

Conclusion

Scaling is one of the most challenging phases of startup life. It is hard and very long term process. And data processing and storage is the key part. Pygmalios can process and store terabytes of data in various services and is ready for more scaling.