Serverless and realtime Data Analytics for a retailer on GCP

Shruti Naik
Jun 5, 2018 · 3 min read

We recently worked with a top retail company in India who was looking for ways to perform realtime data analytics to drive market growth. There were multiple on-premise point of sales systems, a datawarehouse where this data lands once in a day and the customer was experiencing problems related to scale, availability, licenses etc.

The data suite of services from GCP are a perfect way to solve this problem and improve efficiency. This post explains how we helped the customer from scale issues to serverless and from once a day refreshed dashboards to realtime analytics. Read on.

Customer’s enterprise applications were generating an unprecedented amount of data from diverse POS systems. Major concerns were following.

  1. The previous infrastructure was not able to handle large data sets
  2. Uptime

To solve this hurdle while also providing scalable, durable storage and zero downtime with maintenance we choose given GCP stack

Pub/Sub → DataFlow→ (GCS)→BQ →Data Studio

Why this components?

  1. Pub/Sub : To stream analytics data Pub/Sub provides simple, reliable and scalable foundation.
  2. DataFlow : Apache beam dataflow is a fully managed service for stream and batch data processing. It’s serverless approach removes operational overhead with performance, scaling, availability, security and compliance handled automatically so users can focus on programming instead of managing server clusters.
  3. Google Cloud Storage : Provide highest level of availability, consistency, durability and performance for analytics and ML workloads with unlimited object store.
  4. BigQuery : Seamlessly scalable structure can be used to query petabyte scale data warehouse. Main purpose is to use this is we don’t need to worry about building data model because we defeat the purpose of having columnar datastore.
  5. DataStudio : Visualisation tool tightly integrated with GCP.

How did the stack look like on-premise

Overview of how our solution works on GCP:

  1. Pushing XML file from their central data centre to PubSub.
  2. Dataflow job for first subscribing this stream processing data then extracting meaningful information from XML data and later inserting this into BigQuery
  3. Parallelly adding raw XML data into GCS
  4. Configuring dashboard which gives a real time insights having integration with BigQuery

Serverless Architecture Proposed by Us

  1. Pushing XML File into Pub/Sub :

From enterprise’s central data server sending POSLog file to Google cloud PubSub and moving that file to archive for non-duplicate data warehouse.

2. Function to Subscribe the XML data :

3. Beautiful Soup + LXML for Parsing :

Plan is to first subscribe from pub/sub topic and then run ETL job onto that! And saving raw file into object storage for long run. Beautiful Soup is a best python parser for XML and HTML documents. Easiest way that I use to navigate a parse tree using this parser is this :

4. Time to Load Data Into BigQuery :

A small function how we can load data into BigQuery in stream mode :P

def stream_data(dataset_id, table_id, temp1):
records = []
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
#data = json.loads(json_data)
table = bigquery_client.get_table(table_ref)
for i in range(0,len(temp1)):
data = temp1[i]
rows = [records]
errors = bigquery_client.create_rows(table, rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_id, table_id))

5. Dashboard in DataStudio :

Beautiful result on a real time dashboard is here!

Searce Engineering

We identify better ways of doing things!

Shruti Naik

Written by

"Clearly Cloudy" Tech Enthusiastic

Searce Engineering

We identify better ways of doing things!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade