Data Engineering Ready Reckoner

Published in

10xtd

7 min readSep 29, 2021

Curated by, Nitesh Mishra

Introduction of the role

Data engineers build and maintain data systems. They construct datasets that are easy to analyze and support company requirements. They implement methods to improve data reliability, quality and combine raw information from different sources to create consistent and machine-readable formats. They also develop and test architectures that enable data extraction and transformation for predictive or prescriptive modeling.

Data engineering requires tools like SQL and Python to make data ready for data scientists. Data engineering works with data scientists to understand their specific needs for a job. They build data pipelines that source and transform the data into the structures needed for analysis. These data pipelines must be well-engineered for performance and reliability. This requires a strong understanding of software engineering best practices. Data engineering also uses monitoring and logging to help ensure reliability. They must design for performance and scalability to work with large datasets and demanding SLAs. Data engineering makes data scientists more productive. They allow data scientists to focus on what they do best: performing analysis. Without data engineering, data scientists spend the majority of their time preparing data for analysis.

Responsibilities

Analyze and organize raw data
Build data systems and pipelines
Evaluate business needs and objectives
Interpret trends and patterns
Conduct complex data analysis and report on results
Prepare data for prescriptive and predictive modeling
Build algorithms and prototypes
Combine raw information from different sources
Explore ways to enhance data quality and reliability
Identify opportunities for data acquisition
Develop analytical tools and programs
Collaborate with data scientists and architects on several projects

Data engineering helps make data more useful and accessible for consumers of data. To do so, engineering must source, transform and analyze data from each system. For example, data stored in a relational database is managed as tables, like a Microsoft Excel spreadsheet. Each table contains many rows, and all rows have the same columns. A given piece of information, such as a customer order, may be stored across dozens of tables.

Technology it entails

Key Data Engineering Skills and Tools

Data engineers use specialized tools to work with data. Each system presents specific challenges. They must consider the way data is modeled, stored, secured and encoded. These teams must also understand the most efficient ways to access and manipulate the data. Data engineering thinks about the end-to-end process as “data pipelines.” Each pipeline has one or more sources, and one or more destinations. Within the pipeline, data may undergo several steps of transformation, validation, enrichment, summarization or other steps. Data engineers create these pipelines with a variety of technologies such as:

1. ETL Tools. Extract Transform Load (ETL) is a category of technologies that move data between systems. These tools access data from many different technologies, and then apply rules to “transform” and cleanse the data so that it is ready for analysis. For example, an ETL process might extract the postal code from an address field and store this value in a new field so that analysis can easily be performed at the postal code level. Then the data is loaded into a destination system for analysis. Examples of ETL products include Informatica and SAP Data Services.

2. SQL. Structured Query Language (SQL) is the standard language for querying relational databases. Data engineers use SQL to perform ETL tasks within a relational database. SQL is especially useful when the data source and destination are the same type of database. SQL is very popular and well-understood by many people and supported by many tools.

3. Python. Python is a general purpose programming language. It has become a popular tool for performing ETL tasks due to its ease of use and extensive libraries for accessing databases and storage technologies. Python can be used instead of ETL tools for ETL tasks. Many data engineers use Python instead of an ETL tool because it is more flexible and more powerful for these tasks.

4. Spark and Hadoop. Spark and Hadoop work with large datasets on clusters of computers. They make it easier to apply the power of many computers working together to perform a job on the data. This capability is especially important when the data is too large to be stored on a single computer. Today, Spark and Hadoop are not as easy to use as Python, and there are far more people who know and use Python.

5. HDFS and Amazon S3. Data engineering uses HDFS or Amazon S3 to store data during processing. HDFS and Amazon S3 are specialized file systems that can store an essentially unlimited amount of data, making them useful for data science tasks. They are also inexpensive, which is important as processing generates large volumes of data. Finally, these data storage systems are integrated into environments where the data will be processed. This makes managing data systems much easier.

New data technologies emerge frequently, often delivering significant performance, security or other improvements that let data engineers do their jobs better. Many of these tools are licensed as open source software. Open source projects allow teams across companies to easily collaborate on software projects, and to use these projects with no commercial obligations. Since the early 2000s, many of the largest companies who specialize in data, such as Google and Facebook, have created critical data technologies that they have released to the public as open source projects.

Desired persona

Requirements

Previous experience as a data engineer or in a similar role
Technical expertise with data models, data mining, and segmentation techniques
Knowledge of programming languages (e.g. Python)
Hands-on experience with SQL database design
Great numerical and analytical skills
Degree in Computer Science, IT, or similar field; a Master’s is a plus
Data engineering certification is a plus

Mandatory experience in working with data stored in a NoSQL database such as MongoDB managed as documents, which are more like Word documents. Querying the relational database using SQL, and MongoDB which has a proprietary language that is very different from SQL. Working with both types of systems, as well as many others, to make it easier for consumers of the data to use all the data together, without having to master all the intricacies of each technology.

For these reasons, even simple business questions can require complex solutions. Working with each system requires understanding the technology, as well as the data. Once data engineering has sourced and curated the data for a given job, it is much easier to use for consumers of the data.

Guidelines while forming up / reviewing the JD for data engineer

A Data engineer basically organizes data to make it easy for other systems and people to use. They should be able to work with many different consumers of data, such as:

Data analysts: Who answer specific questions about data, or build reports and visualizations so that other people can understand the data more easily
Data scientists: Who answer more complex questions than data analysts do. For example, a data scientist might build a model that predicts which customers are likely to purchase a specific item
Systems architects: Who are responsible for pulling data into the applications they build. For example, an e-commerce store might offer discounts depending on a user’s purchase history, and the infrastructure for calculating that discount is built by a systems architect
Business leaders: Who need to understand what the data means and how others will use it

Data engineering works with each of these groups and must understand their specific needs. Responsibilities include:

Gathering data requirements, such as how long the data needs to be stored, how it will be used and what people and systems need access to the data.
Maintaining metadata about the data, such as what technology manages the data, the schema, the size, how the data is secured, the source of the data and the ultimate owner of the data.
Ensuring security and governance for the data, using centralized security controls like LDAP, encrypting the data, and auditing access to the data.
Storing the data, using specialized technologies that are optimized for the particular use of the data, such as a relational database, a NoSQL database, Hadoop, Amazon S3 or Azure blog storage.

Processing data for specific needs, using tools that access data from different sources, transform and enrich the data, summarize the data and store the data in the storage system.

Basic assessment

To address these responsibilities, data engineers perform many different tasks. Some examples include:

Acquisition: Sourcing the data from different systems. Data is more valuable to companies, and across more business functions — sales, marketing, finance and others areas of the business are using data to be more innovative and more effective
Cleansing: Detecting and correcting errors
Conversion: Converting data from one format to another. Companies are finding more ways to benefit from data. They use data to understand the current state of the business, predict the future, model their customers, prevent threats and create new kinds of products. Data engineering is the linchpin in all these activities
Disambiguation: Interpreting data that has multiple meanings. The technologies used for data are more complex. Most companies today create data in many systems and use a range of different technologies for their data, including relational databases, Hadoop and NoSQL
De-duplication: Removing duplicate copies of data

Sources:
1. https://www.coursera.org/articles/what-does-a-data-engineer-do-and-how-do-i-become-one

2. https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-started/

3. https://towardsdatascience.com/introduction-to-data-engineering-e16c9942dc2c

4. https://www.dataquest.io/blog/data-analyst-data-scientist-data-engineer/

Data Engineering Ready Reckoner

Written by 10XTD