Loading data to PostgreSQL

Dimitris Dovinos
codefully.io
Published in
6 min readApr 1, 2019

--

Loading data into an existing database is part of the lifecycle of the maintenance or the original setup of an application. At codefully.io we have to deal with a variety of data types and sizes. We work with Postgres and we employ different types of techniques and tools depending on the size, quality, and type of data.

Tools

  1. Text editors. I use vim and sublime. Atom seems to struggle with large datasets (a few thousand lines). I have not tested VisualStudio yet.
  2. Psql command line. I rarely use anything more complex that Psql. There are plenty of clients (DbVisualizer, Navicat, etc) but they all seem to add bulk to the process and rarely provide more information that the Psql console.

File Organization

Every change that we make to the database, no matter the size (a few kB or a few GB), we always keep it under version control. We manage several databases and we have a git repository for each database where we store each change. Version control is not that important in itself in this case. SQL scripts tend to be run and forgotten — as opposed to code that needs maintenance. Being able to share the migrations with other developers and store them is the great advantage of using a version control system like git (or mercurial occasionally). Each migration is stored under a separate folder with a date and a brief description. The folder includes files, each of which is usually a migration of a single table, or of a particular area of the data.

File system of database migrations

Having a system for the migrations can prove to be a lifesaver. Keeping track of what you have already done will help you debug issues down the road. Hopefully, you will never have to revisit old migrations but it is not unheard of!

The data is too big!!

Data size is a matter of perspective. For an application that holds blog entries, you may consider a few thousand entries to be too many. A few thousand rows may be a minute’s worth of data for an application that tracks sales of small items. I consider any file that you cannot visually inspect and cannot, therefore, make manual changes to, to be too big. The techniques and tools are similar whether you deal with one hundred rows or 100 million rows:

Lots of memory and hard drive access speed. Our setups (Linux) have at least 16GB of RAM and fairly fast solid state hard drives. I think that this is a minimum requirement for any development environment.

You need to select the correct editor. Sublime is great for handling a few hundred thousand rows (50–100 MB). The editor will allow you to do some basic operations to the file: edit the top or bottom area and quick visual inspection in case something really sticks out (for example a large gap in the data). Most modern editors support multiple cursors which is a very useful tool for manipulating a text file. Beyond that, you need to look into vim.

Search and replace operations are best done with vim and regular expressions. I find regex a pain but I have to admit that there is nothing that comes even close for dealing with big files and complex changes. My favorite examples:

replace consecutive tabs with nulls.
Replace tabs at the end of a file with nulls.

These two examples are lifesavers. Hit the colon sign in vim and you will be in command mode. You can then enter the top one (%s/\t\t/\t\\N\t/g) for replacing consecutive tabs with nulls, or the one below (%s/\t$/\t\\N/g) for replacing tabs at the end of a file.

Replacing millions of entries is not a problem with vim. The amount of changes that you need to do depends on how well formed is your initial data. Usually, you will need to remove empty entries with nulls ( \N ), or perhaps remove some special characters that Postgres does not like.

The data is (always) badly formed!

The data has come from the sales department (or some other area that does not care about you!). The sale people are of course oblivious to your table structure and you realize that in order to reach something that is close to your schema you will need lots and lots of changes. For example:

  1. Manipulation and concatenation/splitting of strings: the input data has the first and the last name joined together but you store them separately! (we will analyze this one further down)
  2. Calculations based on the input data: For example, you need to know how much each item costs in a multi-pack , but the input data includes the total price and the number of items in the pack.
  3. References to other tables: For example, each row references a user but the input data includes just the user's email (not the foreign key).

All these are common problems that involve at least an extra processing step. Luckily you have the best tool available for the job: Postgres itself!

Postgres offers temporary tables which are “automatically dropped at the end of the session”. They are great for doing work that should not be a permanent part of the database. Imagine that you have a sales table which includes the first and last name of the customer as well as the total expenditure. The accounting department has sent you the following data, and your job is to enter it into your database.

Your table looks like (enter \d sales in psql to get a description):

Target table.

Your original data have the first and the last name in a single string, while you need them separated. Create a temporary table that holds the data that is to be loaded. Base the new table on the existing table, but also include some extra columns for data that needs to be further processed.

The t_data table will serve as an intermediate table for loading data, processing it, copying it to the actual database table and then throwing it away.

Notice that we have added a couple of extra columns (customer_first_name and customer_last_name) which we will populate with data from the customer column.

First, we need to load the data to the table. We can use the following SQL script.

I usually prefer to use tabs as a delimiter (which is the default), but for the purpose of this post, it is easier to use commas. Once the script has run, you will have 3 rows in the temp table. There will still be no first and last name data, so we need to extract this from the existing data.

Next, we need to set the first name

And similarly the last name

now the table looks closer to the finished product:

There is one last step for entering the data to the sales table:

Clearly, this was too small a dataset to bother with a temporary table and multiple sql scripts. The approach though is sound and it can easily be expanded to millions of rows.

Hope this helps, and please provide feedback if you would like to cover other postgres issues.

--

--