Data Engineering — What is the plan?

rajesh kumar
Urban Company – Engineering
3 min readDec 8, 2018

Introduction

Data Engineering is defined by different people in different ways.

For me:

  1. Availability of data

Data = data generated within your system (event, transactional data) + data generated outside your system (3rd party data, twitter, weather report, news)

2. Real-time

Data should be available to you as and when it is generated.

3. Ability to build insight

A robust system which can process massively large datasets and give insight.

4. User Self-Service

Treat your colleagues as your user base who interacts with your platform with the proper interface.

5. Engineering hallmark on data quality

Data entering your system should contain your stamp of purity.

PLAN

As soon as you start leading a data engineering team you will be asked this question — What is your 1year plan?

In this article, I am also trying to answer this question.

What is your 1year plan?

I have divided the plan into 4 execution stream and 1 output stream:

1. ACQUIRE

Embrace all data, regardless of volume, source, speed, and type.

Q1 — Ability to consume your internal events in real-time and internal database transaction on a 1hr frequency.

Q2 — Ability to consume data from major databases (MySQL, Postgres, Mongo) in real-time. Dashboard to track network usage cost from each channel.

Q3 — Ability to consume 3rd party data via stream, FTP-server, feeds.

Q4 — Ability to consume all data types —text, image, audio, video.

2. ORGANIZE

Cleaning, Transforming and Organizing data inside a Logical Data Warehouse (LDW) comprising on multiple data stores.

Q1 — Ability to clean and transform stream and transactional data. Setting a columnar data store for running interactive query.

Q2 — Ability to validate data via enforced schema. Setting up a robust, scalable, cost-effective data store like HDFS, S3 and blazing fast data store like Redis, Aerospike.

Q3 — Creating a logical layer above your data store. Dashboard to track storage cost from each user.

Q4 — Ability to store other data types like image, audio, video.

3. ANALYZE

Build a query engine that spans from Traditional Reporting to
Prescriptive Analytics

Q1 — Ability to query and analyze stream data.

Q2 — Ability to query and analyze batch data. Dashboard to track compute cost from each user.

Q3 — Single query engine to query both batch and stream data. Ability to create insights (derived table) and schedule it to run in a periodic fashion.

Q4 — Ability to access/analyze data from multiple interfaces (IPython notebook). Ability to analyze images, audios, videos, data streams.

4. DELIVER

Deliver Data and Analytics at the Optimal Point of Impact.

Q1 — Reporting, Analytics Dashboard, and Visual Exploration.

Q2 — Ability to write an ad-hoc query on Big-Data.

Q3 — Ability to access insights via API or subscribe to triggers for real-time insights.

Q4 — Ability to run ML algorithm on Big-Data.

5. BUSINESS

All your efforts should translate to Business i.e. making more money for the company.

Q1 — Ability to access events, event funnel real-time and take quick reactive decisions. An Interactive Dashboard for slicing and dicing data.

Q2 — Ability to derive complex insights by using large datasets.

Q3 — Ability to store insights in a periodic fashion and use it in services to take more proactive decisions.

Q4 — Ability to derive insights from all datatypes and build more proactive systems.

Please comment on what tool and technology we can use to achieve these goals. That’s it and if you found this article helpful, please click the clap 👏button.

REF:

  1. https://www.gartner.com/binaries/content/assets/events/keywords/catalyst/catus8/2017_planning_guide_for_data_analytics.pdf

--

--