NYC Payroll Data Integration Pipeline Project in Azure

Nevenka Lukic
5 min readOct 19, 2023

The moment has arrived. I finished the final project within the Udacity Data Engineer in Azure course — a truly transformative experience. This final undertaking marked a remarkable achievement in my learning journey.

The project holds huge value for beginners, having all the essential elements necessary for a real-life, practical application. Nearly everything I learned during the course came together within this comprehensive project. It not only increased my understanding but also significantly boosted my self-confidence, a quality I lacked as a novice in the field.

You can find the code related to this project in my GitHub repository.

Project Introduction

The City of New York envisions a robust Data Analytics platform powered by Azure Synapse Analytics. This platform will address two primary objectives:

Financial Resource Allocation Analysis: Analyzing how the City’s financial resources are allocated, with a particular focus on understanding the portion of the budget dedicated to overtime.

Transparency and Public Accessibility: Making data available to the public, shedding light on how the City’s budget is spent on salaries and overtime pay for all municipal employees.

Project Overview

This project involves six essential steps to achieve the desired goals:

Step 1: Prepare the Data Infrastructure

In laying the groundwork for the data processing pipeline, the initial steps involved:

  • Creation and configuration of Azure Data Lake Storage Gen2 to ensure efficient data storage.
  • Setup of an Azure SQL Database, designated to house the current payroll data securely.
  • Creation of an Azure Data Factory resource, pivotal in orchestrating data workflows seamlessly.
  • Establishment of a Synapse Analytics workspace, providing a robust environment for data processing and analytics operations. These foundational steps set the stage for a cohesive and effective data processing pipeline.

Step 2: Create Linked Services

Establishing the necessary connections to data sources and destinations:

  • Creation of linked services for Azure Data Lake, facilitating efficient access and management of data.
  • Setting up linked services for SQL Database, enabling smooth interaction and data retrieval.
  • Establishing linked services for Synapse Analytics, ensuring a seamless flow of data to and from the analytics workspace.

Linked services are integral in establishing vital connections that form the backbone of the data processing pipeline.

Step 3: Create Datasets in Azure Data Factory

Creating structured datasets for seamless data movement and transformation:

  • Development of datasets specifically tailored for payroll files, streamlining their integration into the data processing pipeline.
  • Structuring datasets for master data, optimizing its movement and transformation within the system.
  • Crafting datasets for transactional data, ensuring efficient processing and transformation as it moves through various stages.

Structured datasets play a critical role in enhancing the flow and management of data throughout the process.

Step 4: Create Data Flows

Designing data flows to efficiently move and load data across different destinations:

  • Careful crafting of data flows, ensuring seamless loading of payroll data into both SQL Database and Synapse Analytics.

These meticulously designed data flows are instrumental in facilitating efficient movement and loading of payroll data, enhancing overall data processing capabilities.

Step 5: Data Aggregation and Parameterization

Preparing the data for meaningful analytics:

  • Aggregating and summarizing data to derive key insights.

These two steps are pivotal in preparing the data for in-depth analysis, allowing for the extraction of valuable insights that can drive informed decision-making.

Step 6: Connect your Project to Github

Implementing version control and collaboration using GitHub:

  • Connecting the project seamlessly to a GitHub repository, enhancing collaboration capabilities and enabling effective tracking of changes and contributions.

Integration ensures a streamlined and organized approach to project development, promoting a cohesive and efficient collaboration environment.

Conclusion:

Embarking on this final project was like stepping into the grand finale of a transformative learning journey. It represented a culmination of knowledge, skills, and determination, highlighting the big value the Udacity Data Engineer in Azure course brings to aspiring professionals.

The comprehensive nature of the project, containing all critical elements essential for real-world applications, showcased the true essence of what the course was about. It helped me to become proficient with Azure technologies and also to boost my self-confidence, an indispensable trait for anyone starting a career in data engineering.

I hope that the code and learnings from this project, which are now housed in my GitHub repository, will inspire others and help them to achieve their goals. When I look back on this journey, I can see now that each project was not just a milestone — each step of this journey was also a foundation for the extraordinary possibilities that lie ahead.

If you liked this blog you can follow me on Twitter/X to get notified about my future blogs or you can subscribe on my newsletter for free.

--

--

Nevenka Lukic

Data Engineer/Data Analyst specialised in SQL, Python and Cloud technologies