Data Engineering 101: A beginner's guide to data engineering best practices

Harshit Sanwal
4 min readJul 23, 2022

--

Welcome to Data Engineering 101. In today’s world where every business is growing with the help of data and has multiplied their businesses-x-times before using insights derived from the data to enable strategic data-driven decision making. Having the correct “Data” and extracting the data from multiple sources, transforming it for the ideal use and then finally loading it at the destination for use by the intended stakeholders for analysis has become the key.

To ensure the data is provided as accurately as it can for different stakeholders for analysis, there are certain best practices in data engineering while creating data pipelines followed by data engineers.

Data Quality Checks:

  • Data Type Casting- Casting all the columns to the required datatypes in the dataset is necessary to make sure the right values are stored in the right column and no unwanted values are loaded to the column where they are not required. For example, casting a column with integer datatype will not allow any alphabets in that column.
  • Data Volume Check- When loading data to the destination is completed, a count of values added to the destination should be checked to make sure the number of values pulled are exactly the same as the number of values added to the destination.
  • Duplicate Values- There should be a check for measuring duplicate values in the dataset and, if there are any found should be removed to maintain data integrity.
  • Change in Data Capture(CDC) Alerts- These alerts can notify the dataset owner if a dataset is changed by the specified threshold percentage. This is a great tool to keep track of changes made to the datasets.
  • Missing Data Alerts- These Alerts can notify the dataset owner if the data is not updated on a particular schedule run and can help the owner to check for the jobs which were responsible for the data being updated.
  • Naming Conventions- When creating a dataset, naming conventions should be followed which are used by everyone in the organization, to keep all data columns and dataset names consistent across the data warehouse.

Compliance Checks:

  • No Hardcoded Credentials- There should be no hard-coded credentials in the scripts used in production. Instead, all the credentials should be stored in the environment variables and the variables should be called in the script. This will help to protect the credentials from being leaked.
  • Access Control- The data should only be accessed by the intended stakeholders. Hence, not providing access to the data can help protect us from any unseen mal-intentions and data breaches.
  • Storing scripts/files at a secure location- All the files and scripts you are working on should be saved in a secured shared repository with the team and not on the hard drive of your computer.

Jobs Scheduling:

  • Jobs Concurrency- The schedule of jobs should be made such that jobs running concurrently can be backed by computation capability and should not override the infrastructure which can lead to failure of jobs and unavailability of data.
  • Job Dependencies- All the jobs which are dependent on the other jobs to complete so they can start running should be documented, so that it can help to identify which job should be monitored for successful completion of all the jobs.
  • Failure Alerts- These alerts will notify the job owner if a job is failing and help the owner to re-run the job or debug the issue with the job to provide the data on time.

Conclusion:

As we covered the main areas for best practices in data engineering, there can be even more practices as you work with different tools and data warehouses, especially with cloud platforms. But these practices which we covered will be always helpful in providing the right data. I hope you all like it. Any suggestions and feedback is much appreciated. Thanks!

End Notes:

If you liked this article please do give a clap and follow me for more such articles. Thanks 😄

--

--