First Data Engineering Project : Uber Data Analysis End to End Project

Nisha Sreedharan
11 min readJun 12, 2023

--

As a professional data engineer I love the challenge of taking raw data from different sources and work hard to clean, transform and beautify the data so that it can be turned into a usable format by business or data analysts. However, working in an organisation as a data engineer can lead to certain lean phases when the projects or challenges are just not enough to get that adrenaline pumping. That’s when I was fortunate enough to come across the channel of one Mr. Darshil Parmar. His fantastic channel helped me to work on a great data engineering project with Uber Data Analysis (https://www.youtube.com/watch?v=WpQECq5Hx9g). As a fellow data engineer, Darshil’s channel has really helped me get more focussed on my projects and also learn a lot about cloud platforms like AWS and GCP. I followed Darshil’s most recent video about Uber Data Analysis and my weekend went from ‘meh’ to ‘oh yeahhhh’. I’ll talk about the project and the different components I used in the project. Before starting the project, its important to understand that this project is not for an absolute beginner in data engineering. The expectation is that before this project, you should have a basic understanding of cloud technologies, SQL, Python programming and some idea about data modelling and ETL. If any of these terms seem foreign and scary, please do reach out to me and I will be happy to help you get started with the basics.

The data we shall use for analysis would be very similar to Uber data but obviously Uber can’t get it’s drivers to behave well they would definitely not give us their actual data! So the data set is similar to Uber but not quite. The data set we use is Yellow and green taxi trip data from NYC’s government website — https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. The data project would help us using data from this website and storing it in Google Cloud Storage and then we shall be using Mage for our ETL pipeline. Through Mage we’ll push our project’s data into Google Big Query. And as a final step, we’ll create our dashboard on Looker Studio to visualise the data we transformed and analysed.

End to End Data Flow for the Project

Let me start by explaining the different technologies listed above. The GCP (Google Cloud Platform) is Google’s suite of cloud computing products and services that are provided for various use cases. The GCP services that we shall use are :

  1. Google Cloud Storage: Its an online file storage which is provided as a service by GCP. It helps us store, retrieve files from anywhere in the cloud with an internet connection.
  2. Google Compute Engine: Its the part of the GCP suite that helps us run virtual machines to run our applications. Its easy to create, run and maintain applications on GCP’s compute engine.
  3. BigQuery: Its a warehouse provided by Google which helps us store, analyse large scale data sets using a SQL type interface and query language. Its cost effective and highly scalable based on our data size and requirement.
  4. Looker: Looker Studio is a BI web based tool used for visualisation and reporting purposes. It can take data from multiple sources including Google sheets, BigQuery etc to create interactive dashboards that can convert our data into great charts and enhance the readability of complex datasets.

There are various other tools and technologies that is used for this project apart from GCP services are:

  1. Jupyter Notebooks : Jupyter Notebooks is a web bases interactive platform on which we can run, execute code snippets. This platform helps in running notebook documents via the web browser. Its a great platform to test codes before putting them in production.

You can follow this link to install Jupyter notebook on your system — https://jupyter.org/install. If you are using Mac to start Jupyter Notebook, here are the two commands that you need to start it on your platform. (Assuming you have the latest version of pip installed in your laptop):

pip install notebook

jupyter notebook

Jupyter Notebok Console

2. Mage- Mage is the latest open source tool to set up your ETL pipeline. This tools helps you with focussing only your business logic using certain existing templates it provides to — ‘load’,’transform’ and ‘extract’ the data. I love it’s tag line- ‘Data plumbing without the shit’. https://www.mage.ai/

Mage.ai

3. Lucidchart: Lucidchart is a diagramming software which helps you create design diagrams, flow charts etc for your project. This visual tool is helpful when we want to explain an end to end flow of any project , module to any person who may or may not speak the tech terminology. The free version can help you create upto three diagrams — https://www.lucidchart.com/.

4. Data Modelling: Data modelling is the process of creating visual representation of the data in the entire system or parts of it to understand the connections between each data element. While creating data models its imperative to have understanding of fact and dimension tables.

FACT TABLE: This type of table contains all quantitative measures that we use for analysis. Eg: If we look at school data- number of students, total classes, total number of teachers, total number of departments etc can be stored in fact table. Simply put, its a table which has numbers for the metrics.

DIMENSION TABLE: This type of table contains details/description of the attributes of data. Eg: For students, we shall have columns like student name, date of birth, address etc in these type of tables.

Fact vs Dimension table

5. Data Dictionary: This contains a comprehensive list of all data elements and their descriptions. In any big organisation when the tables go up higher in number the data tracking can be difficult so a data dictionary can help in consolidating all data points and their descriptions. This is highly useful when we build our data models. For this project the data dictionary can be found here — https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf.

Data Dictionary

Now that some basic terminologies are sorted, we can move ahead to the project execution part. Ideally we should start the project with data modeling since that’s the most basic part of any data project. We can download the NYC taxi record data in our Jupyter notebooks to analyse the data in a tabular format. The data for all years is available on the website in many different formats.

DATA MODELLING::

We can look at the data in a tabular form using Jupyter Notebooks using Pandas package in Python.

We can use the Lucidchart tool and drag the entity relationship diagram into our sheet and start building the data models based on the data we have in the NYC taxi records.

In Lucidchart, search for ‘entity relationship’ and drag that onto the sheet to start building the data model

Given below is the final data model that shall be used for the project

Once the data model is ready, we can start with a few transformations on Python.

PYTHON TRANSFORMATIONS USING JUPYTER NOTEBOOKS:

Here we start our python transformation by assigning a primary key — ‘trip_id’ to the entire dataset after dropping all duplicates.

#drop duplicates
df = df.drop_duplicates().reset_index(drop=True)
df[‘trip_id’] = df.index

Once the dataframe is available we create multiple dataframes from the main dataframe to create dimension tables that we created in the model above.

Code mentioned by Darshil is available here — https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/Uber%20Data%20Pipeline%20(Fixed%20Version).ipynb

This code will be subsequently deployed in Mage via Google Compute Engine.

GCP: CREATING YOUR ACCOUNT

If you don’t have your own GCP account then you can create a free tier account which provides you with some basic services that can help you get started on GCP. If you need help creating the free tier account I have written a blog about it. https://medium.com/@nishasreedharan/create-free-tier-gcp-account-1ec66fa2d536. If you still face issues, feel free to reach out to me and I shall definitely help out.

LOADING DATA INTO GOOGLE CLOUD STORAGE

If you have any basic idea about cloud services, you should know that for most cloud based storage we start with creating buckets. We similarly have to create a bucket for getting started on the project.

Type in ‘Google cloud storage’ and get started with creating buckets

When you create the bucket ensure that its public so that we can access it using a URL. However that’s not the safest way to build that while working on a production grade project.

Grant public access to the bucket

Once the bucket is made, upload the nyc file data into the bucket and ensure to change the permissions to make it public and accessible via URL. If you get stuck with any error, refer to ‘Google Cloud and Mage Installation’ part of the video.

GCP COMPUTE ENGINE AND INSTALL MAGE

Since our project requires Mage to run we shall be installing that in our VM we build using GCP’s compute engine.

While creating the VM ensure that the HTTP has full default address and to choose standard machine type with 4 CPU and 16 GB memory. Once instance is created we need to click on ‘SSH’ and this will open up your instance where you can start with your commands. https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/commands.txt will help with all installations required in the VM. You can refer to Mage’s github page for installing Mage. It has various options to install Mage from Docker as well.

RUN MAGE ON COMPUTE ENGINE

Once the installation is successful and once you run Mage using command ‘mage start <project_name>’. If everything is successful you can see Mage running in port 6789. To access Mage from web UI, you would need the external IP of your VM (which you can get once you click on your VM in GCP). You can try typing in your external Ip and the port to access Mage. Eg: 35.127.11.11:6789. But alas! You can’t access it. This is because we didn’t explicitly tell the VM to accept request from port 6789. To remedy this you need go to ‘Firewall’ in GCP and click on ‘create firewall rule’.

GCP Firewall

In the new rule , give details and give the IP as 0.0.0.0/0 and mention the port and create the rule. This should help in opening the Mage UI.

CREATE FILES ON MAGE FOR ETL/TRANSFORMATION

Mage is a lot like other orchestration tools like Airflow with one big difference. Mage provides you with a lot of templates for loading, transforming data .

Mage UI created by Data Loader with API as the input

In the API section you can add the public URL from which the data was downloadable from the Google Cloud Storage. The code can be run using the run icon on top the data loader file. The output should yield the csv NYC data file .

In the next part we create a file for data transformation (using generic template), in that we need to import pandas and write the code we used in Jupyter notebooks above. The data that is outputted from the data loader file goes into data transformation file. The transform files converts data into dataframes of different fact and dimension tables mentioned above. The dataframes are converted into dictionaries which are passed on the next function.

The last part of Mage is writing the output data into Big Query. So we create a file with data exporter with exporting to Big Query. In this there are table details for Big Query and a io.yaml file which has details about the GCP (BigQuery ) credentials. To get the credentials go to GCP console and type in ‘API & Services’ and go to ‘create credentials’ and ‘create service account’ in that.

By creating service accunt we allow services in VM to interact with services in GCP. Once you create the service account go into the created service account and add key. Create a json key and download the file. In the json file you’ll find all the required credentials . You can copy paste that into the yaml file in Mage. In the last transformer code, we’ll need a dataset in Bigquery to add the tables.

For that we’ll go to BigQuery and go to ‘create dataset’ and create a dataset with multi region enabled.

In the Mage’s transformer code, add the table details with the dataset name and the table name that is required. eg: fact_table. If you get a google cloud error when you run the transformer code, it could be due to lack of google cloud packages in the VM. So we again go to the VM and open another instance by clicking on the SSH button and install the google cloud packages using the commands shared in the link above. Refer to this link — https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/tree/main/mage-files to get the code for all individual Mage files. Once the transformation file is run it will load all tables into Big Query.

BIG QUERY DATA ANALYSIS

In the BigQuery console you can see all the dimension and fact tables have been copied. Now you can start querying from the console.

BigQuery console for data analysis

For analysing and creating some dashboard out of the dataset we can create a separate table based on this query — https://github.com/darshilparmar/uber-etl-pipeline-data-engineering-project/blob/main/analytics_query.sql. This query just joins all the required datasets and extracts only some relevant metrics as part of the final dataset that would be necessary to show on the dashboard.

BUILDING DASHBOARD

Now that we have the table specifically for analytics, we can start with building our first dashboard on Looker Studio. Go to Google and type ‘Google Looker Studio’ and click on the first link . Once the interface opens click on ‘create blank document’ and when you get the options to select the source, select ‘Google Big Query’ as shown below.

Select BigQuery as source in Looker Studio

Once you add BigQuery as the source, you might have to authorise it for the first time. Post that choose the dataset and table name that we need to load in the dashboard and it will appear as a table on the dashboard. You can remove or keep the table and design the dashboard as per your choice. Since it was my first dashboard I used the one Darshil built to get started. You can let your creativity flow with this.

This is the final version of the dashboard built by Darshil:

Uber Dashboard on Looker

After completing this project my next step would be to recreate this same project for a different use case since it really helped me learn a lot about various technologies and solving various errors during the journey.

Hope you enjoy this slice of heaven called -data engineering ! Big thanks to Darshil for helping me get started on the journey to build projects by myself.

--

--

Nisha Sreedharan

Decent Programmer. Great Data Engineer. Lazy Traveller. Hardworking Sleeper. Reluctant Reader.