5 most common parsing errors in CSV files (and how CSV Studio can help)
Importing a CSV file can be frustrating. We’ve all struggled with importing and re-importing a file that still contains pesky, difficult-to-identify issues. While CSV import errors can vary widely, we’ve noticed that people are likely to encounter the same handful of CSV parsing errors over and over again. If you learn how to identify and address these errors, importing data can take considerably less time and energy… things we’d all like to have more of.
We’re here to help, both with the five tips below, and with our remote browser for CSV files in the cloud. CSV Studio offers a robust parser and automated error correction for non-conforming and badly formatted CSV files. It enables you to remotely browse files on a data server, automatically identify and correct errors, and export the file to a database free from parsing errors. Watch the demos to learn more.
#1 — Unrecognized Unicode / Illegal Codepoint
Even if this is not the most prevalent issue, this is definitely the first issue you can encounter when dealing with CSV. If you use python to process data, this issue will show up very quickly since python I/O will throw an exception at the first sign of trouble. Many parsers cannot display or process a text field with an illegal codepoint and you are forced to immediately find the correct encoding before attempting any further diagnostic.
A utility like the ‘file -e’ command may work if there are enough codepoints to work with. Be aware that even though most of the internet and personal computers take a default UTF-8 encoding for granted, ISO-8859 Latin-1 is as good a bet as any for data originating from a database.
#2 — Text field with an unescaped delimiter
If the column separator appears unescaped in a text field, this will cause the line to have an extra column. Typically the problem will appear when the CSV file is not using double quotes to enclose text and number fields. Names and addresses are often the result of keyboard entry which means that they can contain all kinds of control characters: /, \ , |, ^, left and right arrows, carriage return, linefeeds, etc…
The rational fix is to re-export the file using double-quotes to enclose columns. If this is not possible and it becomes necessary to remove the extra separators, using CSV Studio to escape the extra separators may help you keep your sanity.
#3 — Quoted string with an unescaped double quote
Many CSV files take the sensible precaution of enclosing all text inside quoted strings. An unescaped double-quote becomes a potential issue.
The accidental use of DJ”S instead of DJ’S creates a single issue in a file with 600,000 lines. This issue is fatal: the rest of the file appears as one single line. The best approach is to re-export the CSV file and escape the double-quotes correctly. If it becomes necessary to do so after the fact, CSV Studio has an algorithm to find the extra double-quotes to be escaped.
#4 — Non-standard escape characters
Unix-style files often use backslash (\) inside of quoted strings to escape the string delimiter.
> “This string has a \” in it”
However instead of using backslash, RFC-4180 CSV files double up the string delimiter as an escape mechanism.
> “This string has a “” in it”
When encoding CSV files, one of those escape mechanisms should be consistently applied. However we sometimes encounter files that use both escape methods at the same time. This creates ambiguity whenever an escape sequence is encountered (see table).
#5 — CRLF / Dos line endings
Line endings are not specifically a CSV issue. Windows and Unix line endings differ, and this will of course affect data files the same as for any other file. The official RFC 4180 line separator for CSV is a CRLF sequence (A carriage return character followed by a new line character). When processing a CSV file with CRLF line endings it is not unusual to find an undesirable ^M (or CR) character at the end of every line. This can even cause issues with some CSV parsers. Processing the file with the dos2unix utility is the standard way of resolving this issue.