Set Up For Success: Creating Tables and Loading Data in Postgres

Dan Watson
Hardwood Convergence
6 min readSep 16, 2019

Last post, we discussed how to create a database and we simplify our initial data. Now we’ll write the scripts to insert our data into our database.

The Plan

  • High level review of postgres data types
  • Write scripts to create tables
  • Run scripts in the psql shell to load data into our tables

Postgres Data Types

I’m sure you’ve noticed by looking at our game log data that we’re working with many different types of data. We have text data for player names and teams, integers for points and rebounds, time for minutes played, and rational numbers for shooting percentages. When creating our database, we need to be cognizant that we’re choosing the right data type for each column we’re importing to ensure the data loads, displays, and acts properly in our database.

If you check out the official documentation, you’ll see a ton of available data types. Just for numbers, you’ll see int, smallint, bigint, float, real, and numeric data types. Despite it seeming complicated, we’re going to follow some basic advice from guru99:

  • Use “text” data type unless you want to limit the input
  • Never use “char.”
  • Integers use “int.” Use bigint only when you have really big numbers
  • Use “numeric” almost always

Based on these simple rules, we’re going to start writing the sql code to create our tables.

Creating Tables and Loading Data in PostgreSQL

So the first questions is probably “Why are we writing scripts to load data into our database instead of using the pgAdmin GUI?” My best answer is: we’re probably going to screw up and will want to recreate these tables. Scripting out the table creation will give us control at the beginning and allow us to quickly make changes if we notice errors when testing our database.

The first table we’ll create will be the player game log table. We’ll keep our naming convention so the table will called gl_player. As you’d expect, we’ll start creating this table by opening up a text editor and creating a new file. We’ll call this file create_tables.sql and we’ll save it in a new folder we’ll call db_scripts.

Now we can start creating the table. The first line we write is going to be somewhat counterintuitive:

DROP TABLE IF EXISTS gl_player;

This command is going to drop any table from our database that is named gl_player. If gl_player does not exist, this command will be ignored. We write the script this way so we can recreate tables in our database quickly without worrying about getting errors regarding the table already existing.

Now that we’re certain the table will not exist, we can write the script to create the table. Creating a table will take this basic format:

CREATE TABLE IF NOT EXISTS <table_name>(
<column name> <data type>,
<column name> <data type>
);

For our gl_player table, we’ll run through columns and the data types that we’ll use:

game_id: TEXT
team_id: TEXT
team_city: TEXT
nba_ID: TEXT
player_name: TEXT
comment: TEXT
mins: TEXT
fgm: NUMERIC
fga: NUMERIC
fg_pct: NUMERIC
fg3m: NUMERIC
fg3a: NUMERIC
fg3_pct: NUMERIC
ftm: NUMERIC
fta: NUMERIC
ft_pct: NUMERIC
oreb: NUMERIC
dreb: NUMERIC
reb: NUMERIC
ast: NUMERIC
stl: NUMERIC
blk: NUMERIC
tov: NUMERIC
pf: NUMERIC
pts: NUMERIC
plus_minus: NUMERIC

You’re probably wondering why we’re using NUMERIC data types for all our stats instead of integers. If you open up the csv in excel or mac numbers, you’ll see fga of 3. However, look how the data is truly stored when we open the csv in a text editor. As you can see, it is actually written as 3.0, but if we use an INT data type, we’ll get a syntax error when we try to load the data.

Our integer data are stored with decimals! Use NUMERIC!

So now that we know how we’re going to fill in our CREATE TABLE command, we just need to know how to tell SQL how to import our data into the table. For that we’ll use the following snippet:

COPY <table_name> FROM <file path> DELIMITER ',' CSV HEADER;

This code is pretty straight forward, but we’ll run through just to make sure we’re on the same page. It’s just telling the database to copy data into our table from the file located in our file path. Then we give SQL a heads up that the file is a comma delimited csv file that contains a header. Putting it all together our file looks like this:

Our create table script for gl_player

Running Scripts in SQL Shell

To run this script, we can open our SQL Shell and log in to our server. At the prompt, you’ll be first asked to enter the server address. You see next to Server, it says [localhost]. This is the default value, so you can hit enter if you’ve been following along and building this database locally. Next, we enter the name of the database to which we want to connect- nba_db. Finally, we’ll hit enter when it asks for the Username since we’ll connect as the postgres superuser. You may be asked to provide a password, if so enter it next and connect. Your screen will look like this:

Connected to the nba_db via the SQL Shell

Now we just have to navigate to the script we just wrote and run it. To do this in a psql shell, just write \i and then the path to the file:

Uploading data into our gl_player table

If the load works correctly, you’ll see that the table was created and that the 187339 rows were copied. Let’s check our database to make sure all the data is there! We can go back to our pgAdmin browser and make sure our nba_db is connected. Right click on the nba_db and hit refresh. Then click down into Schemas > Tables and you should see gl_player listed! If you right click on gl_player > View/Edit Data >All Rows you should get the following output:

All of our gl_player data is in our database!

All of our gl_player data is in our database. Now you can query the data, perform analyses, and create data-driven applications!

What’s Next

We only have one table loaded so far, but I think the methodology is pretty clear. I won’t show the script to create the remaining tables, but it’s all available on the github repo. Take the full script, update the file paths to suit your setup and run it the same as above. Or just write it out yourself so you understand your database!

Next post we’ll start manipulating the raw data to make our raw data is clean and easily usable. After that we can start performing analyses and jump into machine learning. Looking forward to it!

As always, let me know if you have any comments, suggestions, questions, or requests! Also, smash that clap button below if you’ve learned something and let me know you’re here!

--

--

Dan Watson
Hardwood Convergence

Data nerd, basketball fanatic, and ice cream connoisseur. Health care analytics by day, basketball analytics by night. https://www.linkedin.com/in/danielkwatson