Diagnose and edit your data files with CSV Studio
CSV Studio is a visual debugger and remote file editor for data pipelines
Anyone who has imported a CSV file has been through the arduous process of fixing errors, one by one until the import is successful. We created CSV Studio to make it faster and easier for anyone to find and edit errors when they import CSV files. It is a tool with a browser based interface that can visualize large files stored locally or in a remote server. It offers a robust parser and automated error correction for non-conforming and badly formatted CSV files.
If you publish data, CSV Studio allows you to save valuable time making revisions and correcting already published files by testing your data before publishing. And if you consume data, CSV Studio helps you avoid unexpected complications by using our parser and visual debugger to check your data at the point of entry.
This post outlines two scenarios: a typical CSV import workflow using a standard command line tool, and the same process again, streamlined and accelerated using CSV Studio.
Problems you encounter with the traditional CSV import workflow
A typical CSV import workflow involves fixing errors in the CSV file one by one until the import is successful. If we are working on a remote data server, with terminal access, we start with a command line tool such as csvkit to move the data into a database (see Figure 1).
Importing a new file almost never works on the first attempt, and indeed if we try an import, we see our first error: the file is not UTF-8 encoded. To proceed with the import we need to specify the encoding.
To determine the encoding we can use the command line utility “file”. It will analyze the file and return the most likely encoding (see Figure 2). In this case ISO-8859–1, a common default for database files.
We can try the import again with the ISO-8859–1 encoding. In Figure 3, we can see that csvkit hits an error on line 124251: that line has an extra column 43 instead of the expected 42. We must correct this error in order to proceed with our import. Having a CSV file of very large size and with syntax errors severely limits our choice of tool to view and edit the CSV file.
As we can see in Figure 4, csvkit utilities are not going to let us browse such a large and complex file until we correct the error. But since the data file is only 40MB, a lightweight editor like vi will work fine.
The error is shown in Figure 5, in the middle of the vi window. The sequence:
“H”, \N, “22202”, “HARLINGEN MEDICAL CENTER”
“H”, “22202”, “HARLINGEN MEDICAL CENTER”
This perfectly matches the pattern 2 lines above, which has no errors:
“H”, “1918”, “EVANS PHARMACY”
So we can correct this error by simply removing the extra \N and comma. If we try the csvkit import command again, it will work: our third attempt is a SUCCESS! Not bad for a large and complex data file, but as you can imagine this process can quickly become labor intensive, not to mention unreliable. CSV Studio can do a lot to simplify the diagnosis and error correction process.
Use CSV Studio to test and diagnose your CSV files before publishing or using data.
Now we try a different approach, we will use CSV Studio before the import to double check the file for errors.
In the left pane of CSV Studio we can see at a glance all the potential issues in the file. In addition to the issue on line 124252, which we can see easily correct with CSV Studio (as explained here), we see a second issue that went unreported by csvkit. This issue is very minor, however if it’s not corrected it could mean a broken data pipeline downstream. In fact whereas csvkit did not find any issue on line 15993, other parsers can and will reject that line.
This is because a lot of CSV Parsers try to be as lenient as possible to tolerate small variances in CSV syntax — sometimes this removes errors (as was the case with csvkit) — and sometimes not.
In Figure 7, CSV Studio shows us the raw content of line 15993 extracted directly from the unparsed CSV file. We see immediately that the potential issue there is a pair of double quotes escaped with a backslash character (\). Using backslash as an escape is acceptable for many CSV parsers including CSV Studio. However CSV Studio tells us that in this file, the backslash is not used as an escape character anywhere else. So it is a good idea to direct CSV Studio to correct this line and make the whole file RFC-4180 compliant by using the appropriate escape mechanism(*).
(*) Note: Given the nature of the error, the astute reader will suspect that there is more than one way to skin a cat. Indeed, one possibility offered both by CSV Studio and most command line tools is to try and alter the CSV grammar and see if this solves the parsing error. This creates a different set of issues, a topic for a future white paper.
With CSV Studio we can visually examine all potential issues and choose how to correct them. We can trust that the resulting CSV file is RFC 4180 compliant and contains no formatting approximations or hidden errors.