Query CSV Directly with DuckDB

Tom Klimovski
3 min readDec 29, 2022

--

duckdb and csv

You’ve been tasked with one of the following:
— load a new csv file into BigQuery for analysis
— add a new file to your data pipeline
— do some light-touch data quality checks on a vendors file-drop

You reach for the closest and most familiar tooling known to you. Usually it’s to load it all up in Excel, right? Which gets you going. But the file is 600MB and over 1.2M lines long. Excel tells you it can only load the file partially.

Ok, so what next?

You might be on GCP/BigQuery, so you do the clicky-and-droppy but BQ tells you that position 1023845 is no good. How do you even find that byte-position in a file? <sigh>

You search Stackoverflow, and they tell you to use pandas. It gets you all the way there, but your velocity is impeded because you’ve forgotten every command. Perhaps you use ChatGPT, but you need answers right away, and you need them fast.

Wouldn’t it be cool to just `select * from output.csv` ?

Well you totally can.

Step 1

brew install duckdb

Step 2

> duckdb
D> select * from ‘output.csv*’

So instead of a table name, we can refer to our csv. Neat!

You can even do `data/*.csv`, or even other extensions like `data/*.parquet`.

Troubleshooting

So the csv load isn’t going so well. You’ve started with duckdb, it says


Error: Invalid Input Error: Could not convert string ‘Column_1’ to INT32
in column “Column_1”, at line 100002.

Progress! You’ve identified some bad lines in your supplied csv. You’ve got a line number too. You could probably do some terminal `head` and `tail` commands, but what would be better is to be able to count, group and find all the bad lines in a file.

First thing we should do is to create a view on top of your csv with some line numbers so you can start verifying the file and doing some light-touch quality checks.


create sequence seq_id start 1;

D> create or replace view test_1 as SELECT nextval(‘seq_id’) as seq_id, *
from read_csv(‘output.csv’, ALL_VARCHAR=1, AUTO_DETECT=TRUE);

couple of things here, of reference is https://duckdb.org/docs/data/csv

couple of things here, of reference is https://duckdb.org/docs/data/csv

  • You’ve got some bad data. Duckdb samples every 100 lines and guesses the column type. Some data in your column isn’t what it’s supposed to be. `ALL_VARCHAR` allows you to load it all, regardless of type for analysing
    — `AUTO_DETECT` means using the first row, guess how many columns should exist. Otherwise we would have to manually type them out, `columns={‘Column_1,’, ‘Column_2…}’`

ok, with that out of the way, let’s see why that line number is so bad.

At this point, since our sequence is always counting, it’s a good time to lock it down and materialize the view with our line numbers.


create table test_1_mv as (select * from test_1);

Now we’re locked in and we can query test_1_mv and drop things and manipulate it all day long.


— it seems that line numbers in errors are zero-indexed
— however you cannot start a sequence from 0
— so we need to subtract `1` from the id
select * from test_1 where seq_id = 100001
┌────────┬───────────────┬──────────────┬──┐
│ seq_id │ Column_1 │ Column_2 │ │
│ int64 │ varchar │ varchar │ │
├────────┼───────────────┼──────────────┼──┤
│ 100001 │ Column_1_text │ Column_2_text│ │
├────────┴───────────────┴──────────────┴──┤
│ 1 rows │
└──────────────────────────────────────────┘

Apparently the file I’m looking at has duplicated the header in multiple lines, so there’s text where there should be numbers. Nice.

Lets find all the offending rows. My first thought is to look for all rows that have `Column_1_text` in them.


select * from test_1 where column_1 = ‘Column_1_text’;

this yields 11 rows. Another way would be to look for a column we expect to be a number, yet contains letters or alphanumeric data.


select * from test_1
where regexp_matches(Column_1, ‘[a-zA-Z]’)

which yielded 15 rows. Found a few extra bad rows! Let’s remove them


delete from test_1_mv where seq_id in (select seq_id from test_1_mv
where regexp_matches(Column_1, ‘[a-zA-Z]’));

This should have everything cleaned up now.

Writing out the new file — Export

To export the data from a table to a CSV file, use the `COPY` statement.


COPY test_1_mv TO ‘output_cleaned.csv’ (HEADER, DELIMITER ‘,’);

FIN

--

--

Tom Klimovski

Engineer with a strong focus on GCP. Love a great API and an even better IPA.