If Your Data Is Messy — Use Shmessy!

Ohad mata
Pecan Tech Blog
Published in
6 min readApr 4, 2024

--

Are you an analyst? BI/ETL developer? Data engineer?

You know it. You hate it. And yet, you use it on a daily basis.

I’m talking about the CSV format, of course. It’s the same format invented over 50 years ago, and it’s still used daily in our production systems.

It’s the same old dirty text format without any standardization, which makes the import/export experience problematic and unreliable.

Why am I telling you all this? Because I’m fed up and decided to do something about it.

I developed a simple open-source package based on five years of experience with these dirty files. The purpose of this package is to deal with all of the dirty CSV files that users upload to my system. For example, these include CSV files with problematic date formats, unescaped special characters, broken schema, and missing headers.

Before I explain what I did and how I did it, let’s understand what a CSV file is.

CSV files are text files that contain tabular information. CSV stands for Comma Separated Values, and usually, the information will be separated into columns by commas. Rows are separated by a line break character.

But that’s not all. To really turn a CSV file into a tabular object (from now on we’ll call it a DataFrame), we need to know several more parameters. We can request those parameters, as well as the data types of the columns, from the person who created the file. The problem is that not everyone is familiar with those parameters or the wide variety of types that exist.

The parameters that must be known in order to turn a CSV file into a DataFrame are:

  • Separator — The character that separates columns.
    Usually a comma, but could be a semicolon [;], pipe [|], or even colon [:]
  • Row terminator — The character that separates rows.
    Usually a line break will be used for row termination [\n or \r\n]
  • Header — If the first row represents the column names.
    Usually, the header is included as part of the file
  • Escape character — How we should escape special characters.
    For example, a comma that is used in the data itself. Usually a back-slash [\]
  • Quote character — Used to wrap textual data. This parameter is required because textual data might contain special characters. Usually, the Quote char is quote [“]. If we want to represent a quote character inside a string, we should escape it with the escape character.
  • Null value — How we should represent empty data.
    Usually an empty string but also could be the string “NULL” or “N/A”.

So far, we’ve just dealt with the parameters used to help understand the file’s contents. The column’s data type is another problem in itself. No matter which type you choose, you will have to use some logic to be able to identify it with a high probability of accuracy.

For example, numerical information can be written with or without a comma or dot [1,244 / 1244 / 1244.0].

In addition, in some cases, the information looks numerical, but in reality, it is not. Some systems even export dates as a UnixTimestamp, which looks like a purely numeric value:

  • ISO-8601 Format: 2024–03–23T17:31:55+00:00
  • Epoch UnixTimestamp: 1711215148

And if we are talking about dates or timestamps, that too can be quite a problem. There is no standard for dates or timestamps used worldwide. I’ve often found myself trying to identify a date format out of hundreds of possible formats using the strftime function.

And what about boolean information?Here, too, there are quite a few options: TRUE / FALSE, 0/1, YES/NO, T/F, etc.

You can see how this is a frustrating situation! So, how did I deal with the problem?

I developed the open-source package Shmessy.

If your data is messy, use Shmessy!

The package contains a function that accepts a CSV file and turns it into a Pandas DataFrame. The function reads selected parts of the file, guessing their formats, and casting the data to the guessed types.

Let’s talk about the base logic of this project. When the main function is called, we should:

  1. Read the file and guess the basic fetch parameters. I am using Python’s built-in CSV sniffer and some custom logic to improve it in some cases.
  2. Read sample data from the file (by default, a random selection of rows) and guess the columns’ data types based on this sample.
  3. Convert each column to the guessed data type using the Pandas converter or the Shmessy converter. This part includes a pre-made data types interface to handle the variety of types (more details below).
  4. In case of casting exceptions for specific fields: Raise an exception for the whole file, or let the user fallback the problematic data to NULL or convert the entire column to string.

This mechanism is using the plug-and-play classes of data type handlers. Each class should implement:

  • Validate — If the column is validated against this data type, return true. As mentioned, this method gets a sample of the entire column.
  • Cast_column — Cast an entire column to the given data type. This method uses the Pandas interface for the conversion. This casting should be much faster than converting the data row by row.
  • Cast_value — Cast specific value to the given data type. I use this method for specific logic (such as fallback to NULL, or when it’s not possible to cast the entire column)

To identify the correct data type for a specific column, the process orders the types by weight (string is the type with the highest weight, since all of the data can be represented as string). Then, it tries to guess the type for this specific column based on the weight.

The first data type that says, “It looks like this column can be represented using me!” will be chosen as the column’s data type.

Why is Shmessy so powerful? As developers, we must always assume that the input data might be dirty. That’s why the whole package is designed to handle it. Specifically:

  • The fallback_to_string/fallback_to_null optional parameters are available in case you don’t want to raise an exception while reading data that does not match the guessed pattern (the first exception in the examples above).
  • Unit tests are essential. Whenever I find an issue, it must be covered with a test. Right now, there are 60+ tests for various cases.
  • Virtual types are also supported, such as email and UnixTimestamp.
  • The exception mechanism is critical. I decided to make the exceptions as intuitive as I can. Let’s see some examples:
Error in line 117 for column "col_1": Couldn't cast value "2" to type Boolean[(1, 0)].
Error in line 111,002 for column "So": Couldn't cast value "So" to type Integer.
Error in line 46,866 for column "createdat": Couldn't cast value "2023–11–01T02:38:32Z" to type Datetime[%Y-%m-%dT%H:%M:%S.%fZ].

I would love to hear any suggestions for improving the project, your opinion on it, and whether it has the potential to help you improve your day-to-day work as well.

We at Pecan AI have integrated the project into our operational system. When a problematic CSV file generates an error, the first thing I do is check if I have a bug in Shmessy that needs to be fixed, or if there’s really a problem with the file. This approach means I constantly improve the project and make using CSV files easier for all of us.

My hope is that this package will be popular, more and more people will use it, and we’ll improve the logic and the possible edge cases based on real-life examples.

Ohad

References:

--

--