Postgres Redux: File Prep and Data Ingestion

Published in

Hardwood Convergence

5 min readMar 22, 2020

Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash

We’re back up and running. If you missed the last post, check out how we used the nba_api python package to download NBA data. Now we’re going to create a new database to store our files. We usually go through step-by-step detail in these posts, but we’re going to be a bit more brief today since we already have two posts that detail this process. This post won’t be teaching a ton of new material, it’s just to level set before we continue.

The Plan

Create a new postgres database locally
Consolidate our files for our initial upload
Script our initial tables and load data into our database

Creating a Local Postgres Database

If you’ve been following along, you’ve already created a Postgres database before. If you haven’t, check out this post to get step by step instructions for install Postgres and creating a database.

Using PgAdmin, I created a new database called nba_db. You can see that we only have our public schema available and I haven’t created any tables yet.

blank database screen — A blank database called nba_db

Now let’s plan to make our database a bit more organized and we’ll create a schema. A schema is just a namespace in a database. It allows us to organize our tables and we can easily add security and rules around users and schemas. We’ll start by just making a “raw” schema. This will be the location where we load our raw data. Later, we’ll transform our raw data and put it into a new schema where users can access it.

Making a schema is easy. We’ll right click on “Schema” and then choose “Create” and then “Schema”.

Next we’ll name this schema “raw” and then click save.

Now we should see a new schema in our sidebar. If not, just right click and refresh your database and it will show up. Congrats, you have a new schema.

Consolidating Files For Ingestion

There are a couple ways we can upload the data from our csvs into our database. We can write a script that loads every file individually into the database or we can combine the files for each table together and then just load an initial file into the database. I like the latter option, but it doesn’t matter much. Anyway, I wrote about this before so check out this post if you need a refresher. Otherwise, our script will look something like this for all the tables:

#player_bs_trad
head -1 /Users/Dan/Desktop/nba_project/base_data/player_bs_trad/player_bs_trad_2000-10-31.csv > /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/player_bs_trad_init.csv
for file in /Users/Dan/Desktop/nba_project/base_data/player_bs_trad/*.csv;
 do sed 1d $file >> /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/player_bs_trad_init.csv;
done

Once that script is set up to match your file structure, go back to your terminal and navigate to your merge_script directory. Then run the following:

Loading Data to the Database

So we have a database and we have csvs for each of our initial tables. All that’s left is to write a script that tells the database to create tables and load the data in the correct format. We’ve done this before in a previous post, so you can check out the details here.

The file we’ll use is the create_tables.sql file which is located in our db_scripts folder. For each table, we’ll use the following process:

DROP TABLE IF EXISTS schema.table_name;CREATE TABLE IF NOT EXISTS schema.table_name(
 column1_name column1_type,
 column2_name column2_type,
 ...
 columnN_name columnN_type);
COPY schema.table_name FROM '/Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/filename.csv' DELIMITER ','  CSV HEADER;

That should be somewhat straightforward, so now follow that format for all of our tables. Below is a sample for our player_bs table:

DROP TABLE IF EXISTS raw.player_bs;CREATE TABLE IF NOT EXISTS raw.player_bs(
 game_id TEXT,
 team_id TEXT,
 team_abbr TEXT, 
 team_city TEXT, 
 player_id TEXT, 
 player_name TEXT,
 start_position TEXT,
 comment TEXT,
 mins TEXT,
 fgm NUMERIC,
 fga NUMERIC,
 fg_pct NUMERIC,
 fg3m NUMERIC, 
 fg3a NUMERIC, 
 fg3_pct NUMERIC,
 ftm NUMERIC,
 fta NUMERIC,
 ft_pct NUMERIC,
 oreb NUMERIC,
 dreb NUMERIC,
 reb NUMERIC,
 ast NUMERIC,
 stl NUMERIC,
 blk NUMERIC,
 tov NUMERIC,
 pf NUMERIC,
 pts NUMERIC,
 plus_minus NUMERIC,
 file_name TEXT
);
COPY raw.player_bs FROM '/Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/player_bs_trad_init.csv' DELIMITER ','  CSV HEADER;

Before you write your file, I checked out the underlying csvs and noticed an issue with almost all of the team_bs_summary files where the data doesn’t fit the formatting correctly. We therefore won’t load these files into our database.

Now that we have our finished create_tables.sql file, we just need to connect to Postgres and run the file. We’ll do this in the SQL Shell that comes with PgAdmin. If everything runs correctly, it should look something like this:

We can now go back to our PgAdmin and refresh our database. Then we can go to our raw schema and check out the tables and our data:

Wrapping Up

We made a lot of progress today. We started with a few thousand csv files that we aggregated and then uploaded into a database where we can access them as a singular dataset with minimal effort. As always, the code is available here on our github so you can follow along.

Next we’ll check out the data in more detail for validation and then we’ll transform the data to make it more user friendly. After that we’ll start performing analyses and creating dashboards, apps, and websites. Please leave any questions, issues, or improvements in the comments. If you enjoyed today’s post, please share with a friend and smash the clap button. 👏🏼👏🏼👏🏼