Stories by jmregs on Medium

Using Snowflake and Data Build Tool (DBT)

jmregs — Tue, 27 Aug 2024 01:14:06 GMT

On my career journey as a data analyst, I have heard and used a lot of popular databases and data warehouses, but one popular tool I have not used yet is Snowflake. For those who are not familiar with it, Snowflake is a cloud-native data warehouse designed for analytics and data warehousing tasks (OLAP). For this article, I tested out using Snowflake by loading a sample online retail data that I stored in a Google Drive folder to Snowflake using FiveTran, and doing a simple transformation to the data using Data Build Tool (DBT). The diagram for this can be seen below.

Data Pipeline Diagram

What is Snowflake?

Snowflake is a cloud-native fully managed data warehouse platform designed for large-scale data and complex queries. It differs from most data warehouses in that it is designed entirely for the cloud. It does not have an on-premise version. Another unique thing about Snowflake is that it separates the compute and storage resources, which allows users to independently scale each resource, or Snowflake can automatically scale each resource, depending on the load. This allows for big savings when it comes to costs. Lastly, since it is fully managed online, all infrastructure management is handled by Snowflake allowing users to fully focus their time on data analysis.

Testing Out Snowflake

The first thing I did was to get sample data to load to Snowflake. For that, I got some sample online retail data from the UK. I got it from Kaggle with this link: https://www.kaggle.com/datasets/ulrikthygepedersen/online-retail-dataset. I stored the sample online retail data on a folder on my Google Drive. This can be done with any cloud storage service like Amazon S3 or Microsoft Azure Blob Storage. I used Google Drive just to be able to do it quickly.

Loading Data to Snowflake using FiveTran

Once we have the data in our chosen cloud storage, we first need to setup the Google Drive — FiveTran connection, and the FiveTran — Snowflake connection to be able to load the data to Snowflake. For the Google Drive — FiveTran connection, it will just need the link and access to the Google Drive folder. Once the setup has been done, the setup would look like the screenshot below.

FiveTran Setup

The contents of the Google Drive folder can also be seen on the FiveTran user interface as seen below.

Google Drive Folder

As for the FiveTran — Snowflake connection, we will need to setup the user, role, database, and data warehouse first in Snowflake. We will use these in setting up the connection. The good thing about Snowflake is that the setup can be done directly through querying.

The script for this can be seen below.

begin;

   -- create variables for user / password / role / warehouse / database (needs to be uppercase for objects)
   set role_name = 'FIVETRAN_ROLE';
   set user_name = 'FIVETRAN_USER';
   set user_password = 'password12345';
   set warehouse_name = 'FIVETRAN_WAREHOUSE';
   set database_name = 'SAMPLE_DATA';

   -- change role to securityadmin for user / role steps
   use role securityadmin;

   -- create role for fivetran
   create role if not exists identifier($role_name);
   grant role identifier($role_name) to role SYSADMIN;

   -- create a user for fivetran
   create user if not exists identifier($user_name)
   password = $user_password
   default_role = $role_name
   default_warehouse = $warehouse_name;

   grant role identifier($role_name) to user identifier($user_name);

   -- set binary_input_format to BASE64
   ALTER USER identifier($user_name) SET BINARY_INPUT_FORMAT = 'BASE64';

   -- change role to sysadmin for warehouse / database steps
   use role sysadmin;

   -- create a warehouse for fivetran
   create warehouse if not exists identifier($warehouse_name)
   warehouse_size = xsmall
   warehouse_type = standard
   auto_suspend = 60
   auto_resume = true
   initially_suspended = true;

   -- create database for fivetran
   create database if not exists identifier($database_name);

   -- grant fivetran role access to warehouse
   grant USAGE
   on warehouse identifier($warehouse_name)
   to role identifier($role_name);

   -- grant fivetran access to database
   grant CREATE SCHEMA, MONITOR, USAGE
   on database identifier($database_name)
   to role identifier($role_name);
   
 commit;

The script will create a user that will be assigned a role, which will have access to the database and data warehouse to be used to store the data. The data warehouse in Snowflake’s case is the compute resource we use to be able to query and do compute work. The user name, role name, data warehouse name, and database name can be changed to any name. This can be run directly in Snowflake, as seen below.

SQL Script in Snowflake

Once everything has been setup in the Snowflake side, we will just need to put the created credentials for the FiveTran to Snowflake setup. Once that is done, the finished setup would look like the screenshot below.

FiveTran to Snowflake Setup

Once we have everything setup, we can now load the contents of the Google Drive folder content to Snowflake using FiveTran. We can do a manual run by clicking the button below. FiveTran can also be scheduled to have the data load done every day, week, or month, if you want.

FiveTran Manual Run

Using DBT to Perform Simple Transformation

After loading the data from Google Drive to Snowflake using Fivetran, we should now be able to see the data in Snowflake. To check, we can run a simple script to check the table, as seen below.

Checking Table on Snowflake

We can see on the screenshot above that the data was loaded to Snowflake.

Once we have the raw data loaded, we can use DBT to perform simple transformation on the data. For creating the connection between Snowflake and DBT, we first need to setup everything in Snowflake again.

The script for the setup can be seen below.

CREATE DATABASE Analytics;

CREATE WAREHOUSE transforming with warehouse_size = 'MEDIUM';

CREATE ROLE transformer;

GRANT USAGE ON DATABASE SAMPLE_DATA to role transformer;
GRANT USAGE on SCHEMA SAMPLE_DATA.GOOGLE_DRIVE to role transformer;
GRANT select on all tables in schema SAMPLE_DATA.GOOGLE_DRIVE to role transformer;

```
grant usage on database analytics to role transformer;
--grant reference_usage on database analytics to role transformer;
grant modify on database analytics to role transformer;
grant monitor on database analytics to role transformer;
grant create schema on database analytics to role transformer;
```

```
grant operate on warehouse transforming to role transformer;
grant usage on warehouse transforming to role transformer;
```

-- Create user for development environment
create user dbt_dev
email = 'dbt_dev@gmail.com'
password = 'Sample123'
default_role = transformer
default_warehouse = transforming
must_change_password = true;

grant role transformer to user dbt_dev;

-- Create user for deployment environment
create user dbt_prod
email = 'dbt_prod@gmail'
password = 'Sample123'
default_role = transformer
default_warehouse = transforming
must_change_password = true;

grant role transformer to user dbt_prod;

The script above will create a database, warehouse, and role that we will use when connecting DBT to Snowflake. The transformer role allows any users within the role access to the “SAMPLE_DATA” database, which contains the raw data, and the “Analytics” database, which is where the transformed data from DBT will be placed. The script also creates two users called “dbt_dev” and “dbt_prod” that will be used in our development and deployment environment, respectively. These two users have been given the transformer role.

Once we have everything setup in Snowflake, we just need to connect DBT to Snowflake. The first step to this is to create a connection with Snowflake, which we will use in our environments. The setup for this can be seen below.

Snowflake Connection Setup in DBT

Once our connection is ready, we now need to create two environments for the DBT project, as seen below. The “Development” environment is used when developing or testing out the DBT models, while the “prod_environment” is used when we have jobs running the created DBT models.

DBT Environments

For the “Development” environment, the setup for this can be seen below.

Development Environment Setup

It is using the Snowflake connection earlier, but for the development credentials, it is using the dbt_dev user we created earlier. The schema dbt_dev is where the tables will be stored on if we test out the models during development.

As for the “Production” environment, the setup for this can be seen below.

Production Environment Setup

The only difference with the “Development” environment is that we are using a different user and schema. Everything else is the same as the “Development” environment.

Once we have everything setup, we can now create the DBT models to transform our data in the DBT cloud IDE. Since I just wanted to test out how working with Snowflake is, I created just a simple DBT model script, as seen below.

/*
    CREATED BY: dbt_dev
    CREATED ON: August 15, 2024
    DESCRIPTION: Gets the top 500 rows based on the invoice date
*/

{{ 
    config(
        materialized='table',
        alias='top_500_online_retail'
        )
}}


SELECT *,
CURRENT_TIMESTAMP() AS inserttimestamp
FROM SAMPLE_DATA.google_drive.online_retail
ORDER BY INVOICE_DATE DESC
LIMIT 500

The model just basically gets the top 500 rows based on the invoice date of the online retail data. It also adds an “inserttimestamp” column that tells when the data was loaded. The “config” part at the start tells DBT that it should create a table for the model, and that the name of the table should be “top_500_online_retail”.

The script in the DBT cloud IDE would look like the screenshot below.

DBT Model Script

To be able to run the model, just run the script below on the DBT Cloud console. The name of the DBT model is called “prod_dbt_model.sql”.

dbt build --select prod_dbt_model.sql

After running the model, it should now load the transformed data to the new table. This can be seen in the Snowflake screenshot below.

DBT Model Table

Setting up a job in DBT to load data to Snowflake

Now that our model is good to go, we can now setup the job to load the transformed data to Snowflake and schedule it to run regularly. For this, I created a job and used the “prod_environment” we setup earlier. The setup for the job can be seen below.

DBT Job Setup

The command for the job can be seen below. Everytime the job is running, the job will run this model. The job is scheduled to run every 12 hours, but it can be scheduled to run at anytime.

dbt build --select prod_dbt_model.sql

Once the setup for the job is done, the job can be ran manually to test it out, as seen below. Just click the “Run now” button, and it should run the job.

DBT Job Manual Run

After running the job, DBT should load the transformed data to the schema we specified in the “prod_environment”, which is DBT_PROD. We can check if the data was loaded by querying the table in Snowflake. After querying the table, we can see that the table exists, and it has the top 500 rows based on invoice date, as seen below.

DBT Prod Table

Conclusion

Working with Snowflake has been a breeze. It was able to connect to two other popular data tools, FiveTran and DBT, with no issues. As for the Snowflake platform itself, their UI is very easy to navigate, and everything is straightforward. Users can easily setup a database, data warehouse, or query whatever data they want to do. Given the short time I had to work with Snowflake, I loved the experience of using it for a simple project. Given that Snowflake is fully managed, I can see why it became so popular in such a short time. It really allows users to focus on the data, and not with the other work associated with using a database or data warehouse such as infrastructure management, maintenance, and versioning.

Using Airflow with Docker

jmregs — Wed, 19 Jun 2024 05:13:49 GMT

I was looking at a data pipeline I created from my previous article (https://medium.com/@josemarireguyal/creating-an-end-to-end-data-pipeline-with-dbt-data-build-tool-and-coinmarketcap-data-9d616b827597), and I wanted to make an improvement to it, by adding an orchestrator to the mix.

You might be thinking, what is an orchestrator? An orchestrator is basically a task scheduler for your scripts. It allows the user to manage the dependencies between tasks to ensure they go in the right order or sequence, and automates the execution of these tasks based on predefined schedules (e.g. daily, hourly, weekly, etc.). In addition to that, orchestrators allow the users to see the log file for each task to study any errors that might occur during a run.

There are a lot of orchestrators to choose from nowadays, but I decided to go with Airflow.

What is Airflow?

In choosing an orchestrator, I chose Airflow, because it is one of the most popular orchestrators out there. It allows users to create workflows using Python code, which allows for highly customized use cases. It is an open-source project that has a strong community base, which makes it easy to find documentations regarding the tool. Because of these two big things, Airflow is being used by a lot of people from big companies like AirBNB, Square, and Robinhood to people just using it for personal projects.

How does it actually work though?

There are 3 main concepts when it comes to Airflow. These are:

Directed Acyclic Graphs (DAGs) = A DAG is a collection of tasks with defined dependencies and relationships. It ensures that tasks are executed in a specific order without any cycles. Each DAG is defined in a Python script.
Tasks = These represent a unit of work in a DAG. They are instances of ‘Operators’
Operators = These are the building blocks of a task. They define the action to be performed in a task.

I will go into more depth later regarding these concepts when I show the actual code for our example.

Before we start using Airflow, there is actually one problem if you are using Windows. Airflow is Unix-based, which means a lot of the processes used in Airflow is not compatible with Windows-based systems. Since I am using Windows, we need a solution for this. To alleviate this problem, we need to use Docker.

What is Docker?

Docker allows users to package software into standard units called “containers” that has everything the software needs to run. This includes libraries, system tools, code, and runtime.

Docker allows the user to run each container isolated from each other. This sounds a lot like virtual machines (VMs), but Docker is actually more lightweight to use compared to VMs.

The reason for this is that containers in Docker run on the Docker engine, which uses the host OS. This means that even though all the containers are isolated from each other, they are all sharing the same OS as the host OS. Compare this to how VMs work, wherein each VM has their own unique OS, and the hypervisor manages the hardware resources for each VM. This allows for huge customizability, and flexibility for VMs, but in some cases, users do not need this sort of customizability.

This is why Docker is so popular, because it is so lightweight compared to regular VMs.

Why does Airflow run on Docker on Windows?

You may be wondering why Airflow will run on Docker, if Docker containers use the host OS for the containers. That is a valid point, but the Docker engine can actually take advantage of the Windows Subsystem for Linux (WSL2) to create a Linux environment for containers that need a Linux-based environment to run.

Airflow Process

So even though the host OS is Windows, the Docker engine can use WSL2 to create a Linux environment for our containers, which will need a Linux-based environment.

Now, after explaining Airflow and Docker, we can now start showing an example for these two tools working together.

Code

There are multiple parts to getting Airflow running in Docker. I will divide these into sections.

Docker Compose File

To start things off, we first need to create a Docker Compose file, which is a YAML configuration file used to define and manage Docker applications that use multiple containers. The code for this is usually stored in a file named “docker-compose.yml”. This is needed, because we will need multiple services to get Airflow running in Docker.

For this demonstration, this is what the Docker Compose file looks like.

version: '3.8'
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres_data:/var/lib/postgresql/data

  init-db:
    image: apache/airflow:2.4.1
    depends_on:
      - postgres
    environment:
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    command: airflow db init
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins

  webserver:
    image: apache/airflow:2.4.1
    depends_on:
      - postgres
      - init-db
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
      AIRFLOW__WEBSERVER__SECRET_KEY: mysecretkey
      AIRFLOW__WEBSERVER__AUTHENTICATE: False  # Disable authentication
      AIRFLOW__WEBSERVER__RBAC: False  # Disable RBAC
      FLASK_APP: "airflow.www.app"
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
    ports:
      - "8080:8080"
    command: >
      bash -c "export FLASK_APP=airflow.www.app && 
           airflow webserver & 
           sleep 10 && 
           airflow users create -r Admin -u admin -e admin@example.com -f Admin -l User -p mypassword &&
           tail -f /dev/null"

  scheduler:
    image: apache/airflow:2.4.1
    depends_on:
      - postgres
      - init-db
    environment:
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
    command: scheduler

volumes:
  postgres_data:

I will first explain the general parts of the code, and further explain later the use of each service:

version = this refers to the version of the Docker Compose file format. It is usually advisable to use the latest version, since it has the latest features
services = these define the blueprint for the containers that will run on Docker. It defines how the containers will behave in production, how many containers will run, and how these containers will interact with one another. In our case, there are 4 services: postgres, init-db, webserver, and scheduler. I will explain the use of each service after explaining the general parts of the code.
image = this refers to pre-built Docker images used to create the containers for the services. It is a snapshot of an application and its dependencies, which means that it can easily be replicated and deployed across different environments
depends_on = used to check and make sure that the required services are running before starting the service. This is used to make sure dependencies are followed in multi-container Docker applications.
environment = setting the environment variables for each of the service.
volume = these are used to allow persistent data even if the containers are stopped or removed. There are two types of volumes in a Docker Compose file: Named Volumes and Bind Mounts. For the first one, Named Volumes are created by the user, and managed by the Docker application. These are used to persist data and to be able to share it in multiple services. The second one, Bind Mounts, map a directory from the host machine to a directory in the container. This is useful since changes in the host directory are reflected in the container directory, and vice versa.
command = allows users to specify what command the container should run when it starts up.

After explaining the general parts, let us now go into the use and specifics of each service:

postgres = This service sets up the PostgreSQL database that will be used to store Airflow’s metadata
init_db = This service sets up the required tables and schema needed for Airflow to operate. This is stored in the PostgreSQL database that was created in the first service. The environmental variable of this service is referring to the connection string used to connect to the database.
webserver = This service provides the user the web-based user interface used to manage and monitor Airflow. This is accessed through ports 8080:8080, as seen in the code.
scheduler = This service schedules and triggers the tasks based on the DAG definitions. It operates in the background, whenever a DAG is asked to run.

DAG File

Before running Docker, we first need to create the DAG Python script file. This Python script contains one or more DAGs using the Airflow framework, which the scheduler will use to orchestrate the tasks. The metadata of the DAG Python script will be stored in the PostgreSQL database container that was created earlier.

I will also explain each part of the code below.

# Libraries
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
from scripts.fetch_data import fetch_data
from scripts.load_data import load_data_postgres_docker

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 5, 1),
}

with DAG('fetch_and_load_data_dag', default_args=default_args, schedule_interval='@daily') as dag:
    
    fetch_data_task = PythonOperator(
        task_id='fetch_data',
        python_callable=fetch_data
    )

    load_data_task = PythonOperator(
        task_id='load_data_postgres_docker',
        python_callable=load_data_postgres_docker,
    )

    fetch_data_task >> load_data_task

Libraries = These are the needed libraries to run Airflow. Take note that fetch_data and load_data_postgres_docker are Python scripts that I created that will be orchestrated in Airflow.
default_args = These are the arguments that will be used for the DAG.
fetch_data_task and load_data_task = These are the two tasks within the DAG.
PythonOperator = This operator allows you to run a Python function in DAG.
task_id = The task name that will show up in the Airflow webserver for each task.
python_callable = These are the functions from the two Python scripts I created that will get data from an API, and load it into the PostgreSQL database that was created.
fetch_data_task >> load_data_task = sets up the dependency between the two tasks. It indicates that fetch_data_task should go first before load_data_task.

To end this part, the DAG Python script file is usually stored in a “dags” folder in the Airflow project directory.

Python Scripts

These are the Python scripts used. These are just edited versions of the scripts I used in the DBT pipeline I created in my previous article.

The code for fetching data from CoinMarketCap can be seen below.

# Libraries
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
import os

import pandas as pd

def fetch_data():
    # Get data from CoinMarketAPI
    # Storing JSON data from CoinMarketCap to variable. Gets top 100 cryptocurrencies based on market cap.
    url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
    parameters = {
    'start':'1',
    'limit':'100',
    'convert':'USD'
    }
    headers = {
    'Accepts': 'application/json',
    'X-CMC_PRO_API_KEY': '[API KEY]',
    }

    session = Session()
    session.headers.update(headers)

    try:
        response = session.get(url, params=parameters)
        data = json.loads(response.text)
    except (ConnectionError, Timeout, TooManyRedirects) as e:
        print(e)
    
    # Normalizes JSON data to dataframe
    df = pd.json_normalize(data['data']) 

    # Adds timestamp column to API data pull
    df['timestamp'] = pd.to_datetime('now') 

    ## Saving Dataframe as CSV File
    # Get the directory where the script is located
    script_directory = os.path.dirname(os.path.abspath(__file__))

    # Construct the file path for saving the CSV file in the same directory as the script
    file_path = os.path.join(script_directory, 'CoinMarketCapTop100Crypto.csv')

    # Save the DataFrame to a CSV file
    df.to_csv(file_path, index=False)

    print(f'Data saved to {file_path}')

if __name__ == "__main__":
    fetch_data()

This is the code for loading the data to Postgres.

# Libraries
import pandas as pd

import psycopg2
from psycopg2 import sql

import csv
from io import StringIO
import os

import io

def load_data_postgres_docker():
    # Set the working directory to the directory of the script
    os.chdir(os.path.dirname(os.path.abspath(__file__)))

    # Database connection parameters for the PostgreSQL service within Docker
    dbname = 'airflow'  # Database name
    user = 'airflow'    # Username
    password = 'airflow'  # Password
    host = 'airflow_project-postgres-1'   # Docker service name
    port = '5432'       # PostgreSQL default port

    # Connect to the PostgreSQL database
    conn = psycopg2.connect(dbname=dbname, user=user, password=password, host=host, port=port)

    # Create a cursor object
    cur = conn.cursor()

    # Define the schema and table name
    schema = 'staging'
    table_name = 'coinmarketcap_top_100_crypto'

    # Define the path to the CSV file
    csv_file = 'CoinMarketCapTop100Crypto.csv'

    # Open the CSV file to read the header and determine the column names and data types
    with open(csv_file, 'r') as f:
        # Read the header
        header = next(csv.reader(f))

    # Create the staging schema if it does not exist
    create_schema_sql = f"CREATE SCHEMA IF NOT EXISTS {schema};"
    cur.execute(create_schema_sql)

    # Create the staging table with columns matching the CSV file schema
    create_table_sql = f"CREATE TABLE IF NOT EXISTS {schema}.{table_name} ("
    create_table_sql += ', '.join([f'"id" INT'])
    create_table_sql += ");"
    cur.execute(create_table_sql)

    # Truncate the table
    truncate_sql = f"TRUNCATE TABLE {schema}.{table_name};"
    cur.execute(truncate_sql)

    # Open the CSV file to read only the first column
    with open(csv_file, 'r', newline='') as f:
        reader = csv.reader(f)
        next(reader)  # Skip the header row
        first_column_data = [row[0] for row in reader]

    # Create a string with each value in the first column separated by newline
    first_column_data_str = '\n'.join(first_column_data)

    # Create a file-like object from the CSV data string using io.StringIO
    first_column_data_file = io.StringIO(first_column_data_str)

    # Use psycopg2's copy_expert method to copy data from the first column to the table
    copy_sql = sql.SQL("""
    COPY {}.{}
    FROM STDIN WITH CSV HEADER
    DELIMITER as ','
    """).format(sql.Identifier(schema), sql.Identifier(table_name))
    cur.copy_expert(sql=copy_sql, file=first_column_data_file)

    # Commit the transaction
    conn.commit()

    # Close the cursor and the connection
    cur.close()
    conn.close()

    print('Load successful')

if __name__ == "__main__":
    load_data_postgres_docker()

These are stored in a “script” folder inside the “dags” folder in the Airflow project directory.

Airflow Web Server

Once all the files are inside the Airflow project directory, we can now run the Docker application. To do this, just write the code below in a terminal in the Airflow project directory.

docker-compose up -d

This will create all the containers for the services we specified in our Docker Compose file. We can see the containers running in the Docker Desktop application, as seen below.

Docker Containers

After creating the containers, we can now access the Airflow webserver to be able to test out our DAG. To do this, just go to http://localhost:8080/ and this should open the Airflow webserver. This is what the Airflow webserver should look like.

Airflow Webserver UI

We should be able to test out the DAG by clicking on it, and the interface below should show up. Just click the “Trigger DAG” to run the DAG once to see if it will work.

fetch_and_load_data DAG

If there are no errors in the DAG run, the latest data from the graph below should be all green, indicating that there is no error.

DAG Run Status

Based on the latest run, there seems to be no error on the DAG, which means all the tasks ran without any error.

Results

To check if the data was loaded to the PostgreSQL database in the container we created, we can use PSQL, a command-line interface (CLI) tool used to interact with the PostgreSQL databases.

To do this, we first need to access the postgres container in the CLI. We first need to find the container ID of our postgres container using the script below.

docker ps

After running the command in the terminal, this should pop up showing the container IDs of every container currently running.

Container IDs of Docker Containers

After getting the container ID for the Postgre container, we should now be able to access the postgres container, and use PSQL to check the table. To do this, use the script below. The -U and -d in the code refers to the username and database, respectively. We stated in our Docker Compose file that the values for both of these is “airflow.

docker exec -it 0d1a63e5e816 psql -U airflow -d airflow

After running that code, it should show the terminal with the PSQL interface, as seen below.

Postgres Container

We just have to write the script to check the table that we loaded the data into. The schema, and table were created beforehand. I setup the table to only have one column called “id” with the INT data type, since this is only for demonstration purposes. The schema and table name can be seen in the script below that will be used to check the table.

SELECT * FROM staging.coinmarketcap_top_100_crypto;

Lastly, after running the script, it should show the table with the “id” column having values coming from the CSV file that was fetched in the first task.

coinmarketcap_top_100_crypto Table

We can see in the image above that there are values in the “id” column, which means that the orchetstrator worked.

Conclusion

Working with Airflow on Windows has its challenges compared to using a Unix-based system like a Macbook. However, the experience did provide us with the valuable opportunity of learning both Docker and Airflow.

Docker allows users to isolate their applications from one another, similar to a virtual machine (VM), but it is a lot more lightweight and simpler to use compared to a VM.

Airflow, on the other hand, is an industry standard when it comes to orchestrators even up to this day. It still has widespread adoption across companies running complex data pipelines. Despite newer and simpler orchestrators entering the scene, Airflow still has strong support due to its reliability and versatility in production environments.

Mastering Docker and Airflow is essential for gaining insight into industry-standard tools. The knowledge not only enhances proficiency in modern data engineering but also prepares practitioners to navigate evolving technology landscapes effectively.

What is MongoDB?

jmregs — Tue, 21 May 2024 05:35:51 GMT

MongoDB is a NoSQL database designed to handle huge amounts of unstructured or semi-structured data. Since it is a NoSQL database, it means that it is schemaless, compared to the rigid schema design found in traditional relational databases. In MongoDB’s case, the data is stored in JSON-like documents. Each document can have different fields, and the data structure can vary from document to document.

Why use a NoSQL Database?

The main reason people use a NoSQL database is to take advantage of its schemaless nature. Since the database does not require a rigid schema to be setup before being used, companies can quickly use it to store their unstructured data. This allows it to be agile, since if you have a new feature that will add a new field to your data, you can still immediately use a NoSQL database since it is schemaless. There is no downtime in your development.

This is very useful for companies that have data that are constantly changing like social media companies (Facebook, Twitter, etc.), since they constantly tweak and add new features to their applications.

The other big reason people use NoSQL databases is its scalability. NoSQL databases like MongoDB are designed to scale out data through sharding (distributing data through multiple servers). This is crucial for companies that constantly store and add huge amounts of data everyday like Facebook, Instagram, Netflix, etc.

Using MongoDB

For this, I’ll use a NoSQL database called MongoDB to show how it works. I’ll setup a MongoDB Atlas cluster, and load it with sample data.

Setting up MongoDB Atlas

MongoDB Atlas is the fully managed cloud database service that allows users to use MongoDB instances as clusters. It frees up the users in handling other database tasks, such as scaling, patching, and others.

To get started, we will first need to create a free account in MongoDB Atlas, and create an organization and project for our database, as seen below. Organizations and projects are just MongoDB Atlas’ way of organizing the users and resources for each task. Organizations are the highest resource hierarchy. Within an organization, it can have different projects, and you can freely assign, which user will have access to which projects.

MongoDB Atlas Organization and Project

In our case, I created a project named “Sample Project” in the organization I made. Click on the project and go to the database section. You should be able to see the screen below. Click the “Build a Cluster” button.

MongoDB Atlas Project UI

It will transfer you to the cluster configuration section, as seen below. MongoDB Atlas has a free cluster configuration setup called M0. We will choose that configuration for our sample project. I named the cluster “SampleProjectCluster”. After that, just click the “Create Deployment” button, and it should start the cluster creation process. This usually takes a few minutes.

Cluster Configuration Setup

After the cluster has been created, we can now load our sample data.

Loading Data to MongoDB Cluster

After creating and setting up our cluster from earlier, there should be a “Connect” button in the cluster. Click this button.

MongoDB Atlas Cluster Interface

This should come up after clicking the “Connect” button. Click “Drivers” next.

Ways of Connecting to MongoDB Atlas

We should now see the URI connection string. We will use this to connect to the MongoDB Atlas in our code. Each cluster has their own connection string.

MongoDB Atlas Cluster Connection String

After getting the connection string, we then need to create a user account to get access to the cluster in our code. To do this, just click “Quickstart” in the Security question, as seen below. Create a username and password for your user, and click “Create User”. This should create the user credentials that we will use to access the MongoDB Atlas cluster in our code.

User Database Creation

Lastly, we need to allow the IP address of the computer we are using to access the MongoDB Atlas cluster when connecting through our code later. We can do this by just scrolling down on the “Quickstart” section, and we will see the the IP Access List section. Just click the “Add My Current IP Address” button to allow your IP address to connect to the MongoDB Atlas cluster.

IP Access List

Once that is done, we can now load sample data to our MongoDB Atlas cluster.

Loading Data to MongoDB Atlas Cluster

There are different ways to load data to MongoDB Atlas, but in our case, we’ll be using PyMongo. This is the official MongoDB driver for Python applications.

We will just be loading a few sample data to show the capabilities of MongoDB. The code for this can be seen below.

import pymongo
from pymongo import MongoClient

def connect_to_mongodb():

    # MongoDB Atlas connection string
    uri = "mongodb+srv://admin:@sampleprojectcluster.ezawixf.mongodb.net/test?retryWrites=true&w=majority"
    
    # Connect to MongoDB Atlas
    client = MongoClient(uri)

    # Setup the database and collection
    db = client["sampleDB"]
    collection = db["sampleCollection"]

    # Create sample data
    sample_data = [
        {"_id": 0, "name": "jim", "score": 5},
        {"_id": 1, "name": "zeke", "score": 10},
        {"_id": 2, "name": "david", "score": 8}
    ]

    # Insert sample data 
    collection.insert_many(sample_data)
    print("Data inserted successfully")

if __name__ == "__main__":
    connect_to_mongodb()

I’ll be explaining the different parts of the code:

MongoDB Atlas connection string = This is the connection string that we got from the MongoDB Atlas cluster earlier. “admin” is the username, and for the password section, you can just input the password you created for the user you created.
Connect to MongoDB Atlas = This is the client connection using the connection string.
Setup the database and collection = The database here is the “sampleDB”, and the collection is “sampleCollection”. Just to quickly explain, a database in MongoDB refers to a container for collections, while a collection is a group of documents. A collection in MongoDB is similar to a table in a relational database.
Create sample data = This part here is a list of dictionary. Each dictionary here is a document when we insert this to the collection later on.
Insert sample data = The insert_many() function is used when there are multiple documents to be inserted in the MongoDB Atlas cluster.

After running the code, the data should now appear in the database and collection we used in our code. This can be seen below.

Each dictionary in the list of our sample data corresponds to a document in a MongoDB collection.

Loading Data with Different Fields

As of now, all the documents in our collection have the same fields, but since MongoDB is schemaless, we can add a document that has different fields without doing any prior setup to the collection.

To show this, I added another document to the collection with additional fields by editing our code earlier. The additional document in the collection can be seen below.

Additional Document with Additional Fields

As you can see, there were no issues with adding a document with different fields. This is one of the key features that developers like about a NoSQL database like MongoDB, since they can easily add different types of data without worrying about the database. This is especially useful to companies that constantly change their data. This allows them to focus their time in developing their product without worrying about the database.

Conclusion

As you can see, a NoSQL database is very useful for people who need a flexible database that can handle different types of unstructured or semi-structured data. It has become a favorite for developers who constantly have changing data, as it allows them to focus less in setting up their database and more on developing their applications.

Aside from that, it also scales really well compared to traditional databases. It can easily manage millions of documents through sharding, making them ideal for applications that handle massive amounts of data daily, such as social media platforms.

However, despite these advantages, a NoSQL database also has its drawbacks. One significant issue is data organization as the database scales. Since NoSQL databases do not enforce a rigid schema, it’s common for documents with different fields to coexist, complicating data organization compared to traditional databases.

In conclusion, a NoSQL database has its pros and cons. Determining whether someone needs a NoSQL database compared to a traditional relational database really depends on a person’s use case. In most cases, a relational database is the obvious choice, but in special cases where there is a need to handle large amounts of unstructured data, a NoSQL database might be a better option.

What is MongoDB? was originally published in Dev Genius on Medium, where people are continuing the conversation by highlighting and responding to this story.

Creating an End-to-End Data Pipeline with DBT (Data Build Tool) and CoinMarketCap Data

jmregs — Wed, 01 May 2024 02:26:37 GMT

If you are working in the data industry, chances are that you have heard of DBT (Data Build Tool). It is a tool that handles the transformation part of the ELT (Extract, Load, and Transform) process. But, why exactly is DBT such a popular tool nowadays for data analysts and engineers in handling their pipelines?

What is DBT (Data Build Tool)?

DBT is an SQL-based transformation tool that enables data analysts and engineers to build, test, document, maintain, and monitor data transformation pipelines directly in their data warehouse. Having DBT as an SQL-based transformation tool is its main draw for most data analysts and engineers, since SQL usually has a lower barrier to entry compared to using Python when it comes to transformations. This allows more people within the industry to quickly learn and use the tool.

In addition to this, DBT also has features that takes in best practices from software engineering. Some of these features include:

Version control
Testing and documentation
Modularity

Combining all these features together with DBT being SQL-based, it quickly became a popular hit in the data industry.

Creating a Pipeline using DBT

After learning what DBT is all about, I will then show a quick project that I did demonstrating how DBT works in a simple data pipeline. I will be using raw data coming from CoinMarketCap’s API.

Data Pipeline Diagram

The components of the pipeline are:

Getting data from CoinMarketCap API
Loading the data to PostgreSQL using Python
Using DBT to transform the data
Using PowerBI to visualize the clean data

I will quickly go through each component of the pipeline to explain the steps for each component

Getting Data from CoinMarketCapAPI

For this pipeline, I will be using data from CoinMarketCap, a website that tracks cryptocurrency data. To do this, we first need to create a free account in their dev website (https://coinmarketcap.com/api/) to get an API key for us to use.

After creating an account and logging in, this is what the homepage will look like. The free account allows us 10,000 requests / month, which is more than enough for this simple project.

CoinMarketCap API Website Home Page

I copied the API key, and used it in my Python code that gets the top 100 cryptocurrencies, based on market cap, and saves it locally as a CSV file. The code can be seen below. You can just replace the API key parameter with your API key.

# Libraries
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
import os

import pandas as pd

# Run this script in your terminal to enable pulling data from coinbase API if you are getting data rate limit failure:
# jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10

# Allows user to see all columns in dataframe
pd.set_option('display.max_columns', None)


# Storing JSON data from CoinMarketCap to a variable. Gets top 100 cryptocurrencies based on market cap.
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
  'start':'1',
  'limit':'100',
  'convert':'USD'
}
headers = {
  'Accepts': 'application/json',
  'X-CMC_PRO_API_KEY': '(Your_API_Key)',
}

session = Session()
session.headers.update(headers)

try:
  response = session.get(url, params=parameters)
  data = json.loads(response.text)
except (ConnectionError, Timeout, TooManyRedirects) as e:
  print(e)

# Normalizes JSON data to dataframe
df = pd.json_normalize(data['data']) 

# Adds timestamp column to API data pull
df['timestamp'] = pd.to_datetime('now') 


## Saving Dataframe as CSV File
# Get the current directory
current_directory = os.getcwd()

# Construct the file path for saving the CSV file in the current directory
file_path = os.path.join(current_directory, 'CoinMarketCapTop100Crypto.csv')

# Save the DataFrame to a CSV file
df.to_csv(file_path, index=False)

print(f'Data saved to {file_path}')

After running the code, the file should now be saved in the same folder as your code, and saved as ‘CoinMarketCapTop100Crypto.csv’.

Now that we have the file, the next step is to load the raw data to PostgreSQL.

Loading the Data to PostgreSQL

Before trying to load the data, I created a “staging” schema and “prod” schema first in PostgreSQL.

Once we have our schemas ready in PostgreSQL, there are actually multiple ways of loading data to PostgreSQL, but I ended up using the SQLAlchemy library in loading the data.

The code for loading the data can be seen below. The first step in the code is to create a connection to your PostgreSQL server, by inputting the credentials needed. After that, we just need to put our CSV file into a dataframe, and load it to PostgreSQL in the specified schema.

# Libraries
import pandas as pd
from sqlalchemy import create_engine

# Create connection Postgres server
engine = create_engine("postgresql://postgres:admin@127.0.0.1:5432/postgres", echo=False)

# Store CSV file to dataframe
df = pd.read_csv("CoinMarketCapTop100Crypto.csv")

# Load CSV file to postgres
df.to_sql("coinmarketcap_top_100_crypto", con=engine, schema="staging", if_exists="replace", index=False)

After loading the data to PostgreSQL, our raw data should now be in a table on our “staging” schema, as seen below.

Raw Data in the “staging” schema

We are now ready to transform / clean the data using DBT.

Using DBT to Transform the Data

I will break this part further down into steps to make it easy to follow.

Installing and Initializing our DBT Project

To start things off, I installed DBT first using pip. The command for this can be seen below.

pip install dbt

Once you have DBT installed in your computer, go to your project folder and initialize a DBT project. The code for this can be seen below.

dbt init demo_dbt

This should create a folder with all the things needed to create our models. Inside that folder, it will have multiple folders, as seen below.

demo_dbt Folder

Setting up profiles.yml File

The first thing that we will do after initializing our project is to create our profiles.yml file. This file contains the connection information to the database we are using for the project, which is PostgreSQL in our case. This is what my profiles.yml looks like after setting it up.

demo_dbt:
  target: prod
  outputs:
    dev:
      type: postgres
      host: 127.0.0.1
      port: 5432
      user: postgres
      pass: admin
      dbname: postgres
      schema: staging
    prod:
      type: postgres
      host: 127.0.0.1
      port: 5432
      user: postgres
      pass: admin
      dbname: postgres
      schema: prod

I will explain each part of the profiles.yml code:

demo_dbt = name of the profile. There can be multiple profiles within a single profiles.yml file. Each profile could be a different connection to different databases. In our case, since we’re only using PostgreSQL, we only have one profile.
target = default target when running our dbt commands. It is currently set to “prod”, which means by default, it will target the “prod” output when we run dbt commands.
dev, prod = these are the 2 output configurations that we have. For each output, you need to put the connection details to the database you are using. In our case, the connection details of the 2 outputs are basically the same, except for the schema part. The “dev” output refers to the “staging” schema, while the “prod” output refers to the “prod” schema.

The profiles.yml file will tell DBT which database to connect to, and which schemas to use for the models we will create later.

Right now, I have the profiles.yml file in the project folder, but DBT will actually look for it in the default directory. The default directory can be seen below.

C:\Users\(Your username in the computer)\.dbt

There is a way to configure DBT to look for the profiles.yml file in the project folder, but in our case, I just decided to put a copy of the profiles.yml file in the default DBT directory.

Creating the DBT Model

Before we create our DBT model, I’ll first explain what a DBT model is. A DBT model is basically just an SQL script to create transformations, aggregations, or calculations. The resulting output could then be stored as a table, view, or temporary tables. DBT models are usually stored in the “model” directory in our project folder.

For the DBT model to clean the raw data and load it into our “prod” schema, the SQL script can be seen below. The config setup at the start of the code is basically telling DBT to store the resulting output as a table. As for the rest of the SQL script, the transformations are:

Only selecting certain columns
Limiting the data to the top 50 cryptocurrencies based on market cap only
Changing data type of dates from “text” to “timestamp”
Renaming columns

{{ config(materialized='table') }}

SELECT 
 id
 , name
 , symbol
 , num_market_pairs
 , CAST(date_added AS timestamp) AS date_added
 , max_supply
 , circulating_supply
 , total_supply
 , infinite_supply
 , CAST(last_updated AS timestamp) AS last_updated
 , "quote.USD.price" AS USD_price
 , "quote.USD.volume_24h" AS USD_volume_24h
 , "quote.USD.volume_change_24h" AS USD_volume_change_24h
 , "quote.USD.percent_change_24h" AS USD_percent_change_24h
 , "quote.USD.percent_change_7d" AS USD_percent_change_7d
 , "quote.USD.percent_change_30d" AS USD_percent_change_30d
 , "quote.USD.market_cap" AS USD_market_cap
 , "quote.USD.market_cap_dominance" AS USD_market_cap_dominance
 , "quote.USD.fully_diluted_market_cap" AS USD_fully_diluted_market_cap
 , "platform.id" AS platform_id
 , "platform.name" AS platform_name
 , "platform.symbol" AS platform_symbol
 , "platform.token_address" AS platform_token_address
 , CAST("timestamp" AS timestamp) AS timestamp
FROM staging.coinmarketcap_top_100_crypto
ORDER BY "quote.USD.market_cap" DESC
LIMIT 50

The SQL file should be stored in the “model” directory within our project folder, as seen below. In our case, I created another folder within “models” folder called “coinmarketcap_model”, and stored the SQL file there.

Testing the Model

After creating the model, it is usually advisable to create a schema.yml file inside your model folder. It acts as documentation for your model. The schema.yml file for our model can be seen below.

version: 2

models:
  - name: coinmarketcap_clean
    description: "Cleans the data and transfers it to prod."
    columns:
      - name: id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null
    meta:
      schema: prod  # Specify the production schema for this model

I’ll explain each part:

name = this refers to the model name.
description = description of model.
columns = name of each column. You can put a description of each column here, so it can act as a data dictionary. In our case, I only put the “id” column for our tests.
tests = this is a way to test our models in DBT before running them. These are like unit tests for your models. I used 2 tests that are built-in within DBT, which is unique and not null, for the “id” column. You can build your own customized tests, if needed.
meta = specify which schema on the model will run on.

Since there are tests in the schema.yml file, you can run them by running the script below. This will run all the tests within your project folder, if you have multiple models and tests. Since we only have one model, it will only run that test.

dbt test

After running the test, it would show if there is an error or not. In our case, the tests show that all the values of the “id” column are unique and not null.

DBT Model Test Results

Running the Model

After testing our model, we can now run the actual model. This will transform the raw data from our “staging” schema, and load this into our “prod” schema.

To run our model, simply run the script below. This will run all the models in your project folder. Since we only have one model in our project folder, it will only run that model.

dbt run

This is what it should look like if there are no errors after running the model.

DBT Model Run

The transformed data should show up in the “prod” schema, as seen below.

Transformed Data in “prod” schema

Now that we have the clean data, we can now do a simple dashboard to visualize our data as the final step.

Using PowerBI to Visualize the Clean Data

For this step, I connected PowerBI to PostgreSQL to be able to use the clean data. After creating and designing the dashboard, the final output can be seen below.

CoinMarketCap PowerBI Dashboard

Any data visualization tool can work for this step. This will depend on your preference.

Conclusion

In conclusion, DBT emerges as a highly efficient solution in terms of data transformations. The hardest part of the project was actually getting the raw data, and loading it to the data warehouse. Once the raw data was loaded into the data warehouse, the transformation can be done in the data warehouse itself using DBT. In addition to being able to use SQL directly for the transformations, DBT also has the added features of version control, testing, and modularity, which allows the users to make sure that their models are working. With the transformed data readily available, users have the flexibility to perform ad hoc queries, visualize insights, or integrate cleaned data with other datasets as needed. DBT allows users to streamline the transformation process to allow them to be more efficient in creating their data pipelines.

Here is the GitHub repository link for the project if you want to check it out: https://github.com/jmreguyal/coinmarketcap_dbt_data_pipeline

Creating a Basic ETL Pipeline with AWS S3 and Glue

jmregs — Mon, 22 Apr 2024 02:30:46 GMT

I have been studying Amazon Web Services (AWS) for a while now, but I have not actually used it yet in any personal project as of now. To bridge this gap, I decided to create a project that will allow me to use two commonly used tools in AWS to create a simple ETL (Extract, Transform, and Load) pipeline.

This simple ETL project will make use of 2 commonly used tools in AWS. These tools are:

) AWS S3 (Simple Storage Service)
) AWS Glue

Project Diagram

The project will make use of AWS S3 to store the file we will use to transform using AWS Glue. After doing the transformation, the clean data will be sent and loaded back to AWS S3.

I will explain the steps I did in each tool to create the ETL pipeline project.

Creating User with Policies

Before we get started, I first created an IAM user with the necessary policies needed to create this project.

The policies used are:

AmazonS3FullAccess
AWSGlueConsoleFullAccess

Policies Needed for the Project

Once the policies are set up for the IAM user to be used, we can now start with the project.

AWS S3

I will be using a Netflix dataset that I got from Kaggle. Here is the link if you want to see or use it for your own project: Netflix Movies and TV Shows (kaggle.com)

After getting the dataset, I created a bucket with the name ‘bucket-data-pipeline’, as seen below.

S3 Bucket Used in the Project

After that, I created a folder inside the bucket named “data”.

“data” Folder

Inside the “data” folder, I created two more folders named “clean” and “raw”. There are two folders, since the CSV file will first be put in the “raw” folder, and after doing some transformations, the clean version of the file will be placed in the “clean” folder.

“clean” and “raw” Folders

Finally, I uploaded the ‘netflix_titles.csv’ file into the “raw” folder. I just used the default settings when I uploaded the CSV file.

CSV File

AWS Glue

Now that we have the file in our S3 bucket, it’s time to use AWS Glue to be able to transform our data, and to load the clean data back to S3.

The first step that we will do is to create a database in AWS glue. This is where we’ll store the table that will contain the metadata of our Netflix CSV file.

Netflix Database

After that, we then need to create a crawler. This will “crawl” through the S3 bucket that we specify in order to get the metadata and store it in the Glue data catalog. We will use this metadata later during the transformation process.

Netflix Data Crawler

Make sure to associate an IAM role to the crawler to give it access to the S3 bucket, and to specify the database we created earlier. The crawler will create a table inside the database containing the metadata of our file.

IAM Role Associated

After running the crawler, there should now be a table containing the metadata of the file we created.

Netflix Data Metadata

We can now create a Glue notebook that will handle the transformation, and the loading of the clean file back to our S3 bucket. The code that I used can be seen below.

# Run this cell to set up and start your interactive session.
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

from pyspark.sql.functions import col, trim, to_date
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

# Create a DynamicFrame from the Netflix table in the AWS Glue Data Catalog and display its schema
dyf = glueContext.create_dynamic_frame.from_catalog(database='netflix_data', table_name='netflix_titles_csv')
dyf.printSchema()

# Convert the DynamicFrame to a Spark DataFrame and display a sample of the data
df = dyf.toDF()
df.show()

## Transformation
# Transform date_added column to datetime
df = df.withColumn("date_added", to_date("date_added", "MMMM dd, yyyy"))


# Save Spark Dataframe as CSV file in S3 Bucket
# Specify the S3 path and filename where you want to save the DataFrame as CSV
s3_output_path = "s3://bucket-data-pipeline/data/clean"

# Write DataFrame to CSV format in S3 with the specified filename
df.write.mode("overwrite").csv(s3_output_path, header=True)

I got the data by referencing the database, and the table we created earlier in our Glue data catalog.

For the transformation part, I only changed the data type of the “date_added” column. After transforming the data, the code will also store the clean file back to the “clean” folder in our S3 bucket.

Once the notebook is created, it can then be run as a job to be able to transform and load the clean file to the S3 bucket. I only ran the job once, but the job can be scheduled to run on specific times.

Notebook Job

After running the job, there should now be a CSV file containg our transformed data in the “clean” folder in our S3 bucket, as seen below. The name of the clean file looks like that, because it seems that saving a Spark dataframe will result in that name format. An improvement can be made by saving the clean dataframe with another format that is not a Spark dataframeto be able to rename it.

Transformed Data in “clean” Folder

Once the clean file is back to our S3 bucket, you can now do whatever you want with the file, such as loading it into an RDBMS, querying directly straight in S3, or moving the file to another storage service.

Conclusion

To summarize, here are the things that I have done in the project:

Get a dataset (CSV file) from Kaggle to use for the project.
Create an S3 bucket.
Store the raw CSV file to the S3 bucket.
Create a Glue notebook to transform the raw CSV file
Load the clean CSV file back to the S3 bucket

Moving forward, there are still a lot of things that can be done to further improve the project. Here are some of the ideas I have for future improvements:

Query the clean data using AWS Redshift or other RDBMS.
Store multiple CSV files in the S3 bucket, and try to transform and combine these files to one or multiple tables.
Use an orchestrator to automatically schedule getting file extracts, and running the jobs.

As you can see, there are numerous opportunities for enhancing this basic ETL pipeline further. This pipeline serves as a foundational reference for learning the fundamentals of two widely-used AWS tools: AWS S3 and AWS Glue.