Creating a Data Pipeline with DPR

Rajat Gupta
Aug 28 · 4 min read

This is a tutorial for creating a basic data pipeline using DPR. DPR helps you to source and prepare your data, validate it, if there’s errors, allows you to see and repair the issues, and then publish it somewhere. Once this process is working to your satisfaction, automate the pipeline by adding a schedule and let it run in the background so it’s one less thing for you to do.

In this example, we’ll take a timesheet extract. You want to map it to a summary that you can send to your client or atleast check to make sure that people are booking time correctly.

  1. SOURCING

The configuration in DPR, kept in a file called dpr-config.yml, could start as follows:

This will get the data read from the Excel file so you can now work on it in DPR. Note that you can change this to a different source (perhaps a timesheet SaaS API) and it won’t impact the rest of your pipeline.

2. PREP

Prep refers to wrangling or massaging the data into something that you want to get to. In our example, we want to filter to just Client X time entries, map the individuals to the departments, and summarize the hours submitted by project and department.

At the end of these steps, you will have a summary of the timesheet by project, by department, by month for Client X.

3. VALIDATIONS

This is great that something that might take you 20 minutes each week can now occur in 1–2 minutes. You could stop there, but you’d rather have DPR check that things are generally ok. So let’s add some basic validations that if they don’t work, you know you’ve got an issue. Here, we put in checks that Engineering should have submitted atleast 20 hours and certainly not more than 500 hours. Also, all the time entries are mapped to a department — you might have a new person who you haven’t mapped to a department.

You could continue to add rules as you run into issues. The key is that DPR will do the check and go no further if problems are found.

4. REPAIR

Repair is provided for automatically in DPR. In this case, you might actually want to get the raw data again or you might need to tweak your data prep steps but sometimes you can actually view the result data and just fix it directly.

5. PUBLISH

The publish step is likely here to just be putting the data out to a file and possibly to a shared folder somewhere. The step for output is classified as a publish since it is only taken if the validations are passed.

6. SCHEDULE

Scheduling is what makes it worth it. In this case, you are getting a file in a certain directory. Perhaps you wait for the file in a shared folder. It could be more sophisticated, but this is a simple scenario. You no longer need to check each month that the file came on time. DPR will always scan and check when this occurs.

Data pipelines are essential in today’s business world. Businesses are expected to use data to make better decisions. Getting robust data pipelines in place, such that people can trust the data with minimal effort will get people open to using data. DPR from Qvikly can you help get there.

Rajat Gupta

Written by

Qvikly Lists is the simplest tool to gather and share information, with tasks, activity streams, and history. Now available at http://qvikly.com.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade