Analyze Large CSV Files in Minutes With Docker Compose

Stephan Schulze
Project A Insights
Published in
3 min readMar 29, 2018

--

Imagine you have to analyze a massive CSV file. What do you do if Excel or similar tools get a nervous breakdown handling the huge file size? With small files, investigating or comparing CSV files is a walk in the park. But with huge files, it’s a pain if not impossible.

Recently, I had to compare two CSV files with different information but a common identifier, each of them with ~500k entries.

My developer’s heart said: Shouldn’t I use a database?

My head said: Installing a full DB just for that use case?

I mean, I just wanted to check a few small things and then get rid of it again.

The solution — Docker

If you swiftly need to build a small and temporary infrastructure, Docker might be your friend. To get started, just pull the latest database container (Postgres in my case), and you are ready to go.

Part 1 — Docker Compose

Install Docker Compose.

Save the following snippet as docker-compose.yml.

# Use postgres/example user/password credentials
version: '3.1'
services:
db:
image: postgres
container_name: postgres-test
environment:
POSTGRES_PASSWORD: example
ports:
- 15432:5432

Part 2 — Start the Container

Run a docker-compose up.

The container starts running:

Docker downloading

Have the database running:

Docker container started

Part 3 — Access the Database and Load Your CSV

I can recommend DBeaver for that task.

Create a new connection to your dockerized database.
Important: Use the Port which is defined in the docker file: 15432
The username is: postgres
The password is (see docker-compose.yml) : example

Part 4 — Import Your CSV Files

DBeaver offers a nice feature called “Flat files” connections.

Create one by pointing to the folder that contains your CSV files.

New “Flat files CSV Connection”

In the DBeaver UI, you should now be able to see the connection and all CSV files listed under „Tables“. It should look similar to this:

To import these files into your Docker database:

  1. Create a new table within the “Public” schema without any columns (Make sure it is persisted!).
  2. Right-click on that table and choose “Import Table Data”.
  3. Select the “Flat files” connection and choose your file(s) from there.
  4. Adapt the mapping in the next step if necessary.
  5. Finish the import.

That’s it. Enjoy analyzing your data!

--

--