Data Engineering Zoomcamp Series — My Notes

Wambui Gitau
7 min readFeb 7, 2022

--

After being in the Data science profession for a while I decided to explore Data engineering. Having worked in the Africa market, there is a lot of work that needs to be done in the data engineering before we move to Data analysis and even business intelligence.

Found the amazing course — Data Engineering Zoomcamp . The best part about it, its very practical. In this series, I will documenting my notes. This series will be organized into weeks just like the course. I will be adding more resources for areas that are new to me. These notes are a compliment to the course videos which I would advise to watch.

Setup

  • Ubuntu 18.04 WSL2
  • Google Cloud (Will add Azure content later)
  • Terraform
  • Docker
  • Postgresql

Week 1 — Setting up our environment

WSL 2

The first thing to do is to make sure that your wsl version is 2. When installing docker I realized that I was working with WSL 1. If you have been working with Windows I would advise to move to WSL2, everything will be smooth. Or maybe its just me enjoying the best of both sides: Office in windows and dev environment in Linux :).

To set up WSL2, use this link for fresh install https://docs.microsoft.com/en-us/windows/wsl/install .

To check your WSL version, wsl -l -v , this will show the version of wsl running as shown below. If it is 1, you can set 2 as default using this wsl — set-version <distro name> 2 where distro name is the OS distribution name. Mine is Ubuntu 18.04.

wsl versions running

You will also need to have the sudo password. If you have not used it in a while and have forgotten it like me, take the following steps:

  • Set root as default user ubuntu1804 config — default-user root . Replace ubuntu1804 with your distribution name.
  • Reset password for the account passwd username
  • Set normal/regular user as default user ubuntu1804 config — default-user username

To set up your development environment if you have not, use this link https://docs.microsoft.com/en-us/windows/wsl/setup/environment. You will be able to configure vscode to connect with wsl. Once you have WSL2 ready and can connect to it in vscode, next is to install docker.

Docker

Docker is an open platform that allows us to separate our applications from our infrastructure [1]. This makes shipping applications fast. One gets to create a simulated production environment and develop applications there. When moving to production, the infrastructure compatibility will not be ab problem. This is just one of the advantages of working with Docker. It has one con though, it is a very heavy application.

You will install docker on windows then connect it to wsl2.

  • Download the installer from here.
  • Double click on the .exe and follow the instructions.
  • Start Docker desktop. Then go to Settings -> General . Use the WSL 2 based engine should be checked.
Settings -> General
  • If Use the WSL 2 based engine is greyed, go to Resources -> wsl integration and select your default distro.
wsl integration
  • Open vscode and connect to wsl using the green button in your bottom right. Select open folder in WSL if you have have your project folder ready. Note: the name of the folder should be in lowercase
  • To test if it is working: docker -v . Note: When running commands, docker was having network issue. Fixed it by selecting to configure DNS manually in Docker desktop. Settings -> Resources ->Network
Docker installed

To make sure that you dont have to use sudo to run docker, follow the instructions below. For this process, do not run it from vscode terminal.

sudo groupadd docker
sudo gpasswd -a $USER docker
Log out and log back in so that your group membership is re-evaluated.
sudo service docker restart

To run any container, you need an image as a starting point. There are a lot of images in Docker hub. To build an application, you will need Dockerfile — a text based script with instructions of how to create the container image.

FROM python:3.9
RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2
WORKDIR /app
COPY ingest_data.py ingest_data.py
ENTRYPOINT [ "python", "ingest_data.py" ]

More information about the Dockerfile and its it contents can be found here.

* docker build -t <imagename:tag> .* docker run -it <imagename:tag>To add arguments to be passed to the entrypoint file* docker run -it <imagename:tag> <arg1> <arg2>To add environment variables* docker run -it -e <vari_name>=”<value>” <imagename:tag>To mount folder so that when docker is run again you dont loose dat* docker run -it -v <path_to_local_folder>:path_to_folder_in_docker> <imagename:tag>Docker with Postgres
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:13

When mounting volume to docker, make sure you provide the Linux home path. The mounted path (/mnt/c/Users)will not work .

PGCLI

This is postgresql command line client. During this week, installing it was the most troublesome task. After multiple tries the steps below worked:

sudo apt-get install libpq-dev python-dev
pip install pgcli

If not installed well, you will have issues with Sqlalchemy because of this package psycopg2. To connect to the postgres database using pgcli, use the command below.

pgcli -h localhost -p 5431 -u root -d <database name>

pgAdmin

Web-based GUI used to interact with postgres database sessions. For this course we will have start pgAdmin on a different container.

docker run -it \
-e PGADMIN_DEFAULT_EMAIL=’admin@admin.com’ \
-e PGADMIN_DEFAULT_PASSWORD=’root’ \
-p 8080:80 \
dpage/pgadmin4

Access the GUI using http://localhost:8080 in the browser.

For pgAdmin container to connect to postgres container, they both need to be in the same network. So you need to create a docker network

docker network create <networkname>

The command to run both containers will change. The network and the name which will act as the host name will be added.

docker run -it \
-e POSTGRES_USER=”root” \
-e POSTGRES_PASSWORD=”root” \
-e POSTGRES_DB=”ny_taxi” \
-v $(pwd)/ny_taxi_postgress_data:/var/lib/postgresql/data \
-p 5432:5432 \
--network=pg-network \
-- name=pg-database3 \
postgres:13
docker run -it \
-e PGADMIN_DEFAULT_EMAIL='admin@admin.com' \
-e PGADMIN_DEFAULT_PASSWORD='root' \
-p 8080:80 \
--network=pg-network \
--name=pgadmin4 \
dpage/pgadmin4

Note: Every time you run the above commands, you will need to change the name or remove it.

GCP

  • Create an account and activate the credits
  • Create a new project
create GCP project
  • Create a service account — This will be the account used by services such as APIs, Pipelines….
  • Create keys once you create the service account, create a key that will be used by the account. Manage keys, select json. The file will be downloaded.

if its a new installation, do gcloud init before following the instructions

Set environment variable to point to your downloaded GCP keys:

export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

Terraform

It is an open source tool used to provision infrastructure resources. It is referred to as Infrastructure as code. Some advantages include:

  • Infrastructure lifecycle management
  • Version control commits
  • Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…
  • State-based approach to track resource changes throughout deployments

I have not tried changing state of my infrastructure to see how this will work. We also did not write complex code. Still need to go deeper.

Use this link to download the latest version https://www.terraform.io/downloads . Same as gcloud, select the linux version since you will install in WSL2.

Terraform has 4 main commands:

terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-gcp-project-id>"
# Create new infra
terraform apply -var="project=<your-gcp-project-id>"
# Delete infra after your work, to avoid costs on any running services
terraform destroy

Docker-Compose

This is used to run multiple containers with one command. Instead of opening multiple tabs and running the different containers, Docker-compose solves that.

This will be installed in WSL2 just like the others. You can get the latest version from here. You can then use the command below to download.

sudo curl -L "https://github.com/docker/compose/releases/download/2.2.3/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose"sudo chmod +x /usr/local/bin/docker-composeTest
docker-compose --version

You will create a docker-compose.yaml file that will contain the commands that you were running previously.

services:
pgdatabase:
image: postgres:13
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=ny_taxi
volumes:
- “./ny_taxi_postgress_data:/var/lib/postgresql/data:rw”
ports:
- “5431:5432”
pgadmin:
image: dpage/pgadmin4
environment:
- name=value
- PGADMIN_DEFAULT_EMAIL=admin@admin.com
- PGADMIN_DEFAULT_PASSWORD=root
ports:
- “8080:80”

docker-compose up -d To run the file. docker-compose down to stop it.

For more information, check https://docs.docker.com/compose/reference/

--

--

Wambui Gitau

Data scientist transitioning to Data engineer. Will mostly be talking about Cloud Data Engineering concepts