Ten Must Ask Questions for Data Engineering Projects

Published in

97 Things

3 min readJul 21, 2019

Think of this as a checklist of the questions that you need to ask before giving an estimate of when you will deliver, or when you will start design. Andyou definitely need to ask these questions before coding.

Question #1: What are the touch points?

Identify all the data sources that the data pipeline will be leveraging. Also identify all the output locations/systems that will be using the data product your pipeline produces, along with systems that contains configurations, lookups, etc.

Question #2: What are the granularities?

For a given data source, do not assume (based on the sample dataset) the granularity it represents. A given dataset may represent a transaction, a company, a combination of both, or an aggregation based on certain level of granularity. Ask the level of granularity on both the input datasource and the output datasource. For example, ask:

Does the data object represents data on the transaction level, transaction level rolled up to monthly, quarterly, annually or moving window?
Does the data object represents data on the customer level either individual or group of customers?

Question #3: What are the input and output schemas?

Ask for the schema for the input data sources and the output data sources before you start coding. Provide a sample output based on the input schema and the requirements you are given.

Question #4: What is the algorithm?

Most of the dataset produced by the Data Engineering team will be fed into an algorithm that will produce a calculation or prediction. Make sure to understand the input for such algorithm and compare it with the output dataset that you are supposed to produce. At the end of the day, the output produced by your data pipeline must have all the input elements to the algorithm at all stages.

Question #5: Do you need backfill data?

Many of the algorithms use heuristics to build better prediction. However, during development, data scientists may focus on a smaller dataset, but still expect the full history backfill during production. Such requirements have impact on development efforts, delivery date, resources, cost, etc.

Question #6: When is the project due date?

In many cases, the project may have dependencies on other projects. Clarify the due date in reference to other dependencies.

Question #7: Why was that due date set?

Now that you have clarified the project due date, please clarify the rationale behind it as it may lead to more projects. Agreeing to a project due date without understanding its impact on the following project may give a false impression that you will be delivering multiple projects.

Question #8: Which hosting environment?

Ask for clarification on where the data pipeline will be running. Is it going to be hosted internally or on the cloud? What cloud accounts, permission to the datasets, and resources you will be using?

Question #9: What is the SLA?

Are the datasets produced real-time? Batch? When is it suppose to be delivered to the customer?

Question #10: Who will be taking over this project?

Many projects will be maintained by other people. Ask for their skill level and the kind of documentation they need to operate your data pipeline.