Big data pipeline: How to get the elephant inside the room
Transcript (with minor tweaks) of my talk on Purplle’s Serverless Data Pipeline @ AWS India Summit — 2016
Big data is certainly the elephant in the room that everyone is talking about. There is hardly any conversation in the tech realms that doesn’t include big data. It gives companies competitive advantage over others, aids decision making and adds amazing analytical prowess in organizations of all shapes and sizes. Therefore in today’s times its imperative that startups take control of their data and start treating it like a first class citizen.
But its easier said than done! Whenever I meet young startup CTOs I find that they are often bogged down by the cost, volumes and processes that comes along with big data.
“Its too early for us” is the common notion. “We don’t have the varied skill-sets in the team which are required to manage this”. “We are not ready for a huge spike in operating expenditure without an immediate ROI” . “We want to finalize our big data strategy before going ahead with implementation”.
My Suggestion — Start storing data now — be it events, logs or click streams . The strategy of creating value from it can continue evolving. We are at a point in time where technology has progressed to an extent that it no longer takes a rocket scientist to extract information and derive actionable insights. There are amazing visualization tools and database warehouses like Redshift & Big Query that enable people from varied skill-sets and backgrounds to become masters of data. Having said that, you need to ensure that there are minimal overheads in terms of people, operating cost and infra management
What if I tell you that at Purplle.com, we have created a highly scalable, server-less data pipeline — at unbelievably low operating costs and engineered by only 1 developer. It helps us collect millions of data points everyday!
Purplle’s Data Pipeline — The thought process
Lets start from the beginning — We sat down and sketched out a basic architecture of the data flow from event producers to data lake
- Variety — Diverse Data Sources — Apps ,Web, CRM. Diverse Data Formats — Events, Chats, Structured Data, Unstructured Text
- Velocity — Uneven capacity needs, typically millions a day, split into crests, troughs and spikes.
- Veracity — Biases, noise & abnormalities in data
We tried to define the ideal infrastructure needed for each leg of the data pipeline.
- Collectors & Routers- Ability to handle massive influx of traffic like click-stream events & ad impressions (by buffering data in persistent message queue which helps absorb temporary outages/limited scale of downstream sinks) . Ability to handle multiple downstream flows of real-time and batch data pipelines
- Data lake — An ideal data lake should be abundantly redundant & durable ( you wouldn’t want to loose raw data), capable of I/O at high volumes and highly available.
- Data Warehouse — It should be flexible to allow experimentation with data modelling. Ingestion of raw data from data lake, no matter how frequent, should be seamless. Querying for data should be simple.
- Hot Data Tier (NoSQL/Cache) — Fast reads and writes for unit and batch operations at scale. Ability to have the same performance at uneven traffic.
Solution — Think Server-less !
Do not reinvent the wheel. Leverage open source & community driven technologies. Leverage server-less cloud technologies.
We could see the benefits of using server-less technologies.
- Low cost of experimentation, agility in development.
- Pay per use — There is no need to commit to a particular infra specifications
- Highly scalable & available.
- Completely Managed -We could focus on building our core product
So, How does our pipeline looks like now?
- Trackers [Kinesis Sdk & Api Gateway with Lambda] — Kinesis SDK & Api gateway with Lambda are used to collect data from our Apps, Website, Server & CRM seamlessly. It ensures there is no data leakage due to network or connection errors, without having you worry about managing exceptions & retries
- Collectors [Kinesis, Lambda, Kinesis Firehose] — Kinesis is the Kafka equivalent that can buffer streaming data. Schema policing, validations & enrichers are written on nodejs which is run only when Lambda is triggered from Kinesis. Firehose streams the validated data into downstream sinks S3 & Redshift. A copy of data is also ingested in our real time prediction engine & eventually DynamoDB
- Data lake [S3] — AWS S3 is a massively durable, highly available & ridiculously cheap object store. It supports on the fly data encryption as well. It is a perfect data lake and is widely used in the industry for this very use case.
- Data warehouse [Redshift] — AWS Redshift enables us to quickly model & query our data using standard SQL queries. Loading up raw data into a model can be easily done via a few clicks using the AWS Data Pipeline. Its very powerful, cheap & flexible in terms of changing the size of the cluster on the fly.
- NoSQl/Caching [DynamoDB] — We use DynamoDB as our hot data store. Its a fully managed, scalable, low latency NoSQL Database.
You can find details on AWS products here.
This is, of course, a very high level view of our architecture. It has certainly evolved & matured from where we started as we understood the small nuances better. Of course, there were workflow issues to tackle, steep learning curves & process additions along the way. Today at Purplle, this data powers the knowledge engine, real time recommendations, user persona and affinity models, pricing, merchandising, marketing channel optimization and a whole lot more. “How Purplle created value from data” is certainly a topic I would want to write about later. But to talk about the elephant in the room, you’ve got to put in there first.
I would love to hear your feedback on this, please share your thoughts about this topic.
Cross posted on linkedin