Prepping for Postgres: Installation and File Preparation Using Bash

Published in

Hardwood Convergence

7 min readSep 15, 2019

Hey y’all, sorry it’s been a while. I’ve been traveling a bunch since the last post, but now we’re back at it. Last post, we downloaded a bunch of NBA data and now we’re going to start the process of making it usable. In this post, we’re going to start prepping the files to load them into a database. If you need the files, grab them here. Here’s our plan:

The Plan

Introduce and install PostgreSQL
Prep our gamelog files for import

Introducing PostgreSQL

Postgres is a free, open-source relational database. It’s fully featured, allowing complex queries, sub-selects, foreign keys, triggers, views, functions, and starting with PostgreSQL 11- stored procedures. A lot of this stuff we won’t need to use starting off, but may be beneficial in the future. One quick aside, a lot of analytics projects use MongoDB, which is a NoSQL data store. MongoDB is great and fast, but since we have structured data, I prefer PostgreSQL. If there is interest in NoSQL options, let me know and we can create some tutorials around that as well.

Installing Postgres

The Postgres installation is super easy. First, head to the postgres download site and choose the correct version for your machine. I’m on a mac, so I’ll be using the 11.5 download. Once downloaded, run the installer and go through the installation process:

After clicking “Next”, you’ll see options to install PostgreSQL Server, pgAdmin 4, Stack Builder, and Command Line Tools. We’ll install all of these components.

Now just continue through the installation process- just select the default options if any come up regarding ports, etc. If this is your first time installing PostGres, it may ask for a database superuser password. The superuser has the ability to bypass all security checks, so make sure this is a secure password that only you know!

After installation, it will ask if we want to launch Stack Builder at exit. Just deselect it and click finish and we’ll be on our way.

Now that we have Postgres installed, you can just go to your launcher and click pgAdmin. When it opens, you may need to enter your superuser password. Once that’s complete, you’ll see a screen like this:

If we click on the arrow next to Servers in the left menu, we’ll get a drop down that shows the versions of PostgreSQL available on our machine. You will probably just show PostgreSQL 11, but I have some older versions also installed. Either way, we’ll right click on PostgreSQL 11 and Create > Database.

The first thing we need to do is name our database. I’ll use nba_db and we’ll keep the owner as postgres. We’re going to leave the options on the other tabs as defaults now since we’re just creating a local database. One thing you’ll want to notice is under the Definition tab there is a option for Connection Limit that is set to -1. This allows a theoretically unlimited number of connections to the database. You may want to limit this value in future instances where many outside users are connecting to the database as performance erodes based on system resources. Anyway, click save. Congrats, you’ve created a database! Now just right click on PostgreSQL 11 again and click refresh. You should see something like this:

You can see that nba_db is greyed out, this just means that we’re not currently connected to the database, but it exists! You can click on it and check it out, but currently the database is empty. Just to get your oriented on where we’ll find our tables, if you click Schemas > public > Tables, that’s where we’ll find our data once imported. Since we’re making a simple database, we’ll just use the single public schema right now, but typically you can use multiple different schema to separate access to different tables amongst user privileges. For example, we could have a “raw” schema, where we store all the raw data that only the administrators could access and then have public tables that our end users could access. Anyway, we have a database, now let’s prep our data.

Prepping Game Logs for Import

As you remember, we downloaded all the game logs from the 2012–13 NBA season to present. All this data is located in 14 different folders in our base_data folder. Each of those folders contains over 7000 individual files. We’re obviously not going to load each file into the database manually, so our plan will be to combine the contents of of each folder into a single file so we result with 14 combined files. We could do this with a python loop, but that’ll be too slow. We’ll instead write a simple Bash script to combine the files. Fair warning, I’m not the best at Bash, so if you have a cleaner way to do this please share!

Here’s our strategy:

For each of the 14 folders, we’ll name a specific file and take the column headers and write that to a new file.
Loop through all the files in the folder and copy all data, skipping the columns.
Write the copied data into the new file we just created.

Before we start writing this script, let’s create some folders to keep everything organized. Inside our base_data folder, I’ll create a db_init_data and merge_script folder. We’ll store the bash script in merge_script and the 14 files we create go into db_init_data. That’s enough, open up your text editor and create a file called merge_data.sh and save it in the merge_script folder.

Let’s start with gl_player folder. The first file that comes up for me is gl_player_0021200001.csv. We’ll use that to get our column names. We’ll use the Bash “head” command followed by a dash and the number of lines we want to take. We then need to indicate where we’re putting this line of data. We’ll put it in the db_init_data folder and we’ll name that new file gl_player_init.csv. Once you put it together, your first line will look like this. Just update to correct the file path on your computer:

head -1 /Users/Dan/Desktop/medium_tutorials/base_data/gl_player/gl_player_0021200001.csv > /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/gl_player_init.csv

Now we write our loop. We start this in Bash with our typical “For <variable> in <directory>”. Our loop will look like this:

for file in /Users/Dan/Desktop/medium_tutorials/base_data/gl_player/*.csv;

The *.csv means that we looking in the gl_player folder and taking all files that have any characters followed by .csv. Those will be the files we loop through. Next we instruct what happens with the looped file. In Bash we use the do command followed by sed, which indicates we’re manipulating a text file. The basic format we’ll use is this:

do sed <#of rows to ignore>d $<file variable> >> <end file>

Therefore, our next line is:

do sed 1d $file >> /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/gl_player_init.csv;
done

Finally, we add “done” to the end of the script to show that this command is complete. The final command for this folder is then:

head -1 /Users/Dan/Desktop/medium_tutorials/base_data/gl_player/gl_player_0021200001.csv > /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/gl_player_init.csv
for file in /Users/Dan/Desktop/medium_tutorials/base_data/gl_player/*.csv;
 do sed 1d $file >> /Users/Dan/Desktop/medium_tutorials/base_data/db_init_data/gl_player_init.csv;done

Let’s test this single command to make sure everything is working correctly before we expand it to the remaining 13 folders. You can open up a terminal and navigate to the merge_script folder. Once there run:

bash merge_data.sh

On my machine, it takes about 30 seconds to run. You’ll see a new command line prompt when it finishes. Then you can go to your db_init_data folder and if everything is running correctly, you’ll see:

Perfect, that gives us one file that we can load into our database for our future gl_player table! Let’s delete this file and expand our Bash script to include all folders. There is probably a better way to do this, but I’m just going to replicate the script we have to go through all 14 folders. I’m not going to screenshot it here, but just grab the file from the github and update your folder path. Once you run it, your db_init_data folder should have your 14 files of base data.

Wrapping Up

Now we have a database and some base files to load into it. This post is getting a bit long, so we’re wrapping it up here and in the next post we’ll load the data into postgres. Then we’ll have a repository from which we can quickly pull a ton of stats, perform analyses, and make applications.

As always, the files are available on the github. Let me know if you have any questions and I’ll have the next post up within the next day so we can really start moving forward with this data. Smash that clap button to let me know you were here and talk soon.