Speeding up data import from SFTP into PostgreSQL in Rails

Most recently, I was tasked with retrieving a CSV file from a remote Secure File Transfer Protocol(SFTP) server, extract certain fields from the file containing at least 5000 rows and save each of the rows as a record in PostgreSQL. Here is a summary of how I was able to pull that off :)
Introduction
I broke down the task into the following chunks:
- Connect to SFTP
- Check for a file match(regex came in handy here)
- Download the file into an in-memory storage
- Parse the file data from in-memory into an array
- Destroy the existing locations database (more on this later)
- Iterate through the locations array saving each item as a record
- Reconnect to SFTP and archive the processed file
Connecting to SFTP
SFTP, known as SSH File Transfer Protocol or Secure File Transfer Protocol is a secure way of transferring files between two remote systems. The difference between SFTP and FTP (File Transfer Protocol) is that SFTP is a protocol packaged with SSH to transfer data/files over a secure connection.
To connect to an SFTP server, Ruby has net-sftp and net-ssh gems. Add them to your Gemfile and run bundle install
.
gem 'net-sftp'
gem 'net-ssh'
I had this helper method for starting a standalone SFTP session. It will start up an SSH session using the given arguments, and will then start a new SFTP session with the SSH session. This will block until the new SFTP is fully open and initialized before returning it. Check this documentation for details.
With that in place, we can iterate over the directory we need to check for our file.
Check for a file match
This regex expression is a redacted version of the original one for security reasons.
Download the file into an in-memory storage
I decided against writing the data to disk yet and chose to pass the IO object as the destination:
:read_size
is the maximum number of bytes to read at a time from the source. It defaults to 32,000
bytes.
Parse the IO string and save the data into an array of locations
Refresh/recreate the locations database before creating new records
The locations database was meant to be recreated every day. So here is how to do it.
Use the Rails’ destroy_all
method to clear all the locations from the database.
Location.destroy_all
Iterate over the locations and create new DB records
ActiveRecord Import gem
Much later, I realized that optimization for speed was needed to reduce the time it took to import these locations into the database. Luckily, I found the activerecord-import gem used for bulk insertion of data into a database using ActiveRecord. So I changed the above code to:
In addition, add the following activerecord-import configurations to the Location
model:
require 'activerecord-import/base'require 'activerecord-import/active_record/adapters/postgresql_adapter'# I'm going to use postgresql hence the 'postgresql_adapter'
Putting it all together
For brevity, check out the code in the lib/locations/downloader.rb
file. I converted this into a Rake task. If you are new to Rake tasks, check this blog for more information.
Clone this repo and run a few tests with actual SFTP credentials or customize the code to just process a huge CSV file from anywhere. The speed of import will dazzle you.
The archive logic can be found in the lib/locations/archiver.rb
.
It’s my hope that reading this blog has given you the tips and confidence required to deal with the menace of importing huge CSV datasets into your database.
Thanks for your time!