Analyze ClickStream Data — Journey From DynamoDB to RedShift

Bhuvanesh
Bhuvanesh
Sep 24 · 5 min read

In any e-commerce related website or application, the clickstream logs are playing a major role in the real-time recommendation engine. Recently we worked on a solution to analyze clickstream data which is stored in DynamoDB. We exported the DynamoDB data to Redshift and ran out analytic queries and then we used some ML algorithm for some kind of analytics.

Architecture for this PipeLine

Current Setup:

Few app servers which are hosted on EC2 has clickstream integration. The realtime data has been captured by Kinesis and push the data into DynamoDB for permanent storage and ad-hoc querying. The structure of the data which is stored in the DynamoDB is a little weird. The clickstream integration is adding the data type as a key for the values. MySQL is used as an OLTP database.

DynamoDB Schema:

Let's say we are storing the below information.

  1. Date: String
  2. Name: String
  3. Product: JSON
{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"products": {
"13128": {
"quantity": "1"
},
"price": "799"
}
}

For some application flexibility, instead of mapping the key and value directly, it's constructed as nested one.

{
"KEY": {
"Datatype": "value"
}
}
{
"date": {
"s": "2019-04-20"
},
"name": {
"s": "www.mywebsite.com"
},
"products": {
"m": {
"13128": {
"m": {
"quantity": {
"s": "1"
},
"price": {
"n": "799"
}
}
}
}
}
}
  • S — String
  • N — Numeric
  • M — Map (Nested JSON)

Complexities:

  • DynamoDB is not an OLAP database to run aggregate or massive selects and joins.
  • Need to optimize the schema and remove the Data type from the JSON and use proper Key-value mapping ("date":"2019–04–20")
  • Need to know all the fields(columns) to create the redshift table. Since its a NoSQL, so the sample data has only 3 columns, but few other records may have upto 40 fields.

ETL Pipeline:

  • AWS Data Pipeline: We can’t directly push the DynamoDB data to Redshift. So first export the data from DynamoDB to S3 as JSON format using AWS Data Pipeline.
  • Data Cleansing: As I mentioned above, we need to cleanse the data and convert Key: DataType: Value structure to Key: value structure. We are going to use an EC2 instance and create an ETL script to pull the data from S3 → cleanse it → upload to S3.
  • AWS Glue Crawler: We don’t the list of column names, so Glue Crawlers will read the JSON data and return the list of columns.
  • RedShift: The data is ready to load. Create a table and load it to RedShift. Even though we have nested JSON, but we’ll use RedShift’s json extract functions to do this. For now, load the data as it is.
  • OLTP Data: To perform the Analytics, we need some more data from the Transaction database, so we setup another ETL pipeline using Matillion tool.
  • Analytics: Finally, we can run our Analytics query on top of this data.
  • Machine Learning: We also used AWS Sagemaker to perform and recommendations.

Complexities in ETL Script:

AWS Data pipeline exported the data in JSON format, multiple json lines in one single file.

{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
}
{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
}

You can see the new line is the delimiter for the next JSON line. But while reading this via Python, it’s not a valid JSON. It should in an array like below.

[{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
},
{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
}
]

So what I decided, we can open one file and read the first line and push this line as a JSON string, then perform the cleanup, and it worked as we expected. Below is the Python ETL script, which we have written to clean up the data and save the cleansed data into a different folder. Once the ETL script has been executed, then upload the cleansed files to S3. These files will load into the RedShift.

import os
import json
inputdir = '/opt/bhuvi/input/'
outputdir = '/opt/bhuvi/output/'

def filterJson(payload):
for key, value in payload.items():
if isinstance(value, dict):
for key1 in value.keys():
if (key1 == "M"):
payload[key] = value["M"]
if (key1 == "m"):
payload[key] = value["m"]
tmp = filterJson(value)
if (tmp != None):
payload[key] = tmp
else:
if (key == "s" or key == "S"
or key == "N" or key == "n"):
return payload.pop(key)
for filename in os.listdir(inputdir):
with open(inputdir+filename, 'r+') as f:
lines = f.readlines()
for i in range(0, len(lines)):
line = lines[i]
y = json.loads(line)
filterJson(y)
opened_file = open(outputdir+'cleansed_'+filename, 'a')
opened_file.write(json.dumps(y)+ "\n")
opened_file.close()

Complexities in identifying the list of Columns:

We don’t know about this data at all. So it's very hard to identify the list of column names even we got two different schemas when we ran the Glue Crawler. Lets take ta look at the below two files.

{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
}
-- Rest of lines are same structure.
{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1",
"price": 100
}
{
"date": "2019-04-20",
"name": "www.mywebsite.com",
"quantity": "1"
}

If you see on File2’s 1st line, it has a different JSON structure. So the glue crawler created two tables.

  • Table 1: Structure of File1 (3 columns)
  • Table 2: Structure of File2’s first line (4 columns)

But no worries, compare both the tables, and pick the new columns which you found in the 2nd table. This is your list of columns.

Complexities in loading into RedShift:(you may think)

  1. In our actual data, we had 50+ columns, file1 had columns, file2 had 45 columns.
  2. Mapping the right key to Redshift column.
  3. Writing a JSON PathFile .

But we don’t need to worry about these things.

  • COPY command AUTO option, this will automatically map your JSON Key to your column name.
  • One file contains multiple JSON lines and according to Python, its not a valid JSON. But in RedShift it is.

Conclusion:

This is one of the most complex NoSQL analysis in Relationalized way that we worked. The main focus of this blog is, export a NoSQL database into RedShift and do your analytics in a relational way. Hope you found this helpful.

Searce Engineering

We identify better ways of doing things!

Bhuvanesh

Written by

Bhuvanesh

Cloud | BigData | Database Architect | blogger thedataguy.in

Searce Engineering

We identify better ways of doing things!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade