Building an end to end analytics pipeline using Einstein Analytics, Kinesis, Spark and Redshift.

Published in

Analytics Vidhya

4 min readAug 15, 2020

https://bq-magazine.com/the-7-habits-of-good-data-scientists/

If you are a computer programmer or working in any tech related industry, then chances are that, at least once a day google for answers in Stack Overflow.

Stack Overflow is a question and answer site for professional and enthusiast programmers.The website offers a platform for users to ask and answer questions, and through active participation to vote questions and answers up or down.

This series is aimed at providing a comprehensive view on building ,designing and developing an analytics\AI data pipeline for stack overflow using the AWS stack and finally build a dashboard in Einstein Analytics.

Pipelines are the heart of analytics and ML and quite often this is the hardest part of an analytics or ML problem. If you have a well designed pipeline, then half your battle is over.

Since, this is going to be a long post, I wanted to cover this in 6 different articles. Feel free to jump to any article that piques your interest.

Introduction to Stack Overflow and Business Requirements.
Technical Design Architecture For an Analytics Pipeline.
Data Ingestion using Kinesis Firehose and boto3.
ETL and Data Processing Using Apache Spark on AWS EMR.
Data Storage in Redshift.
Einstein Analytics Data Prep & Dashboards.

So let’s dive straight to it!!

Key Steps in any Project Pipeline

Understanding Business Requirement

First step in designing any analytics or data science project is to understand how it can drive value to the end users.

There are two ways we can understand this :

Understand the main user groups or end users of the application.
How the company makes money by catering to those user groups.

So then who might be the stack over flow users?

Let’s Understand Our Users in a bit more detail!!

Understanding our users is critical in gathering business requirements and UX plays a key role here. Any well designed pipeline is useless, if it doesn’t satisfy the needs of the user.

Creating User Persona’s is one way to help guide the ideation process and understand the needs, expectation and behaviour of different users.

Personally, I have found user research and persona’s to be very effective in designing dashboards and huge life saver in terms of time and efficiency.

So let’s look at the persona’s developed after doing some mock user -research.

I want to focus on the internal users here, because most likely they will the ones taking advantage of the dashboards.

However, if your pipelines are well designed , then it can be scaled and re-used for any use case such as an ML problem.

1. UX Persona for an Internal user

Photo Courtesy : ThriveGlobal.com , https://www.interaction-design.org/

2. UX Persona Of a Developer

Key Take Away’s from UX Research

A well designed pipeline can also bring all the required data in a centralised repository which can be used for a highly interactive visualisation.
Automatic Prediction of Tags can be a great way to minimise user input. This is an ML use case and if our pipelines are well designed, then it can be definitely used for this purpose.

Summary Of Our Business Requirements

Now that we have our 2 persona’s and their pain points addressed, let us capture this in the form of a user story.

“As a growth manager, I want a way to visualise all my key metrics, in a interactive manner, so that I can understand the overall engagement of users with the stack overflow website “
“ As a developer, I want an automatic display of tags so that I don’t have to manually input the tags myself”

Now, let’s understand how to conceive a technical architecture for this business requirement.

This is explained in this article!