Just load this CSV file into BigQuery they said…

tim.ellissmith
Appsbroker CTS Google Cloud Tech Blog
6 min readJul 12, 2022
Some of the issues you may encounter with CSVs

Imagine that as a data engineer you’ve been asked to load a CSV file into BigQuery? This is can be simple.

bq load "my_dataset.my_table" source_file.csv

There it is! Simple.

But wait! There’s more!

What if the CSV file has errors?

That can also be simple.

bq load --max_bad_records=10 "my_dataset.my_table" source_file.csv

This works, but it leaves us with a couple of questions:

  • How many errors should we allow?
  • How do we report on errors?
  • Do we just ignore those bad entries?
  • How do we handle the bad file? Do we leave it in the same bucket?

This becomes more difficult if the size of the CSV is unknown. If we allow 10 errors and the file is 10 lines big, this is very different to 10 errors with a file size of 10 million lines (100% vs 0.0001%).

Perhaps we want to be permissive and capture all the bad lines and then report on them? This gives rise to further questions: How will be capture them? Where will we report them?

Okay, let’s assume we’ve gone with the simple solution and that the CSV is going to be expected to be correct so we don’t need to worry about errors.

What happens if we re-load the data?

Following on from the above question: If the file does have errors, and is corrected and then reloaded, how do we handle duplicates?

If the load file is a constitutes a table in itself, we can run:

bq load --replace "my_dataset.my_table" source_file.csv

But often we are appending to an existing table. In this case do we:

  • Purge the file data in the database before loading?
  • Deduplicate as an interim step before loading to our master tables?
  • Merge the data into the database?

If so, how do we co-ordinate this? This may require an orchestration tool.

But the CSV file is a non-standard CSV!

Let’s look at some of the non-standard aspects of a CSV:

  • A different delimiter: Delimiters are often things like semi-colons or pipes. The BQ api can generally handle them
  • Quote characters: A well formatted CSV should have a quote character in-case the delimiter appears in text.
  • Quote character appearing in middle of text: Let’s say that the quote character is ‘. So this text could cause confusion
'She said: It's mine'
  • New line characters in CSV files. This is generally okay if it’s in a quote:
Column1,Column2,'Column\n3'

But it can cause issues if it’s not:

Column1,Column2,Column\n3
  • A blank column in the middle of a CSV:
Name, Address,,Comments
Tim, ZZ1 AA23,,A new customer

This can particularly pose problems when matching schema

  • Jagged edges:
Name, Address, Comments
Tim, ZZ1 AA23
Clare, YY2 BB11,This has a comment
  • Extra columns not defined in expected schema

Here our expected schema is:

Name, Address, Phone

But the CSV file has:

Name, Address, Phone, Comments
Tim, ZZ1 AA23, 1111 111111,A new Customer

While each of these issues can be handled, if you don’t know about them before you load them in then you might find yourself in trouble later.

Dreaded Dates

The BigQuery api only accepts dates of the format: “YYYY-MM-DD” but a CSV could have many different formats of date e.g. 11-July-2022. Do you need to transform these dates into a readable form? If so, how are you going to do so?

Of course, dates are only one thing to take into account, we may need to transform other fields like time, or even converting integers for example if a column is quoted in 000s, we may need to transform 100 to 100 000.

Here there is no simple solution with the BQ api, so you may need to write some transformation code. Be careful where you write it though because of the next question.

How big is a file?

So, we know how to handle the above errors and have written a Cloud Function to deal with them. It triggers when a file arrives in the bucket, does the transformation and loads the file into BigQuery.

We load out test file of 100 lines, and it loads in a couple of seconds. We’ve written unit tests, deployed it to our dev environment, and are ready to test it against real data.

The first file arrives and its 20 GB. We rapidly increase the memory requirements and the timeout of our CloudFunction to the maximum, but we still face a timeout.

What do we do now? We are already tight on deadline but at this point you will need to re-write your pipeline using different tooling, potentially leading to delays in delivery.

While most data engineers will know of tools to do this, we may have avoided using them due to amongst other things (1) Lack of knowledge (2) Infrastructure requirements (3) Tight deadlines leading to lack of time to do proper scoping.

Let’s add some row Meta-Data

Now we find that in addition to all the above requirements, we need to include some row meta-data. This could include:

  • The filename that this row was extracted from
  • The line number of the row in the file
  • The date extracted from the source
  • The date (and potentially time) loaded into BQ

Now we have questions of where do we get this data from (in the case of data like extract date), how do we pass it to the correct function and how do we add it?

Note that if we want the line number of the row and it isn’t already in the data, then this will rule out using the bq load command before adding it, as the BQ api makes no promises about load order.

We’ve loaded the file. Now what?

Let’s go back and assume we can load the file simply via a CloudFunction calling the BigQuery Api. The data lands in BigQuery. What do we want to do from here?

  • Rename columns according to our business logic?
  • Profile the data to get summary statistics?
  • Check for outliers in the data and report on them?
  • Check for Personal Identifiable Information (PII) in the data?
  • Join the data with other data or model the data?
  • Tag the data with appropriate information in Data Catalog?
  • Make the data available including limiting access as appropriate?

Conclusion

While loading a simple CSV file seems like one of the most simple things that a data-engineer can do, in reality there are a lot of complexities involved when doing even the simplest tasks. If this is part of an automated pipeline which needs to be productionised, any mistakes can lead to costly delays and refactoring.

Understanding the requirements upfront and planning accordingly can reduce these risks.

It is also important to understand where these CSV files come from and how much room we have to push back and ask for properly formatted files.

This is actually one of my favourite interview questions. I would expect a good data engineer to anticipate these problems and show and understanding of the potential issues.

In my next blog, I will cover a checklist of questions to ask when you create a data pipeline, and then I will talk about how to solve some of these problems as well as looking at what a complete pipeline may look like.

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--