A quick tale about FEFF, an invisible UTF-8 character that wrecked our CSV files
Today, we encountered an error while trying to create some database seeds from a CSV. This CSV was originally generated by me using a Ruby script which piped the output to a file and saved as a CSV.
The CSV was checked in to Git and had been used for awhile until we had to update some parts of it by adding a new column and fixing some values.
While we don’t know the exact reason yet, my theory is that somehow, Excel for Mac (we are all using Macs) added some additional metadata to it even after saving the file as a CSV.
This in turn made anyone using the seed receive the following error:
CSV::MalformedCSVError: Illegal quoting in line 1.
I opened the CSV file and nothing looked suspicious. My first thought was some left/right quotation marks were somehow mixed into the file instead of just the ‘normal’ double quotes: "
. But upon further investigation, there was nothing out of the ordinary. This led me to just wipe out the whole file, and actually type out the first row again.
I saved that file again and ran the migration:
CSV::MalformedCSVError: Illegal quoting in line 1.
What?!
Okay, this was driving me nuts. I opened up a new file, typed the exact single line again, and ran the migration. It worked. So what was in that file?!
Only one way to find out:
cat companies.csv | pbcopy | pbpaste > temp.csv
rm companies.csv
mv temp.csv companies.csv
git diff
So OSX has these two functions that are very useful: pbcopy
and pbpaste
. Basically anything piped to pbcopy
gets into your clipboard and pbpaste
puts what you have on your clipboard to standard output (stdout). But it removes all formatting.
Very useful when you want to just copy some text from somewhere and you want to paste it into a WYSIWYG editor without all the formatting. Like when writing an email from Gmail, for example.
I then removed the original file and saved the new ‘unformatted’ file with the same file name so I could see the difference.
And we finally saw the invisible man:
A quick Google search told us that our friend U+FEFF
was called a ZERO WIDTH NO-BREAK SPACE
. Also, a quick trip to Wikipedia told us about the actual uses for U+FEFF
, more commonly known as Byte order mark
or BOM
.
Our friend FEFF
means different things, but it’s basically a signal for a program on how to read the text. It can be UTF-8
(more common), UTF-16
, or even UTF-32
.
FEFF
itself is for UTF-16
— in UTF-8
it is more commonly known as 0xEF,0xBB, or 0xBF
.
From my understanding, when the CSV file was opened in Excel and saved, Excel created a space for our invisible stowaway, U+FEFF
. And in front of the file to boot!
Excel did some magic, and it was probably saved in UTF-16
instead of UTF-8
. UTF-8
does not understand BOM
and just treats it as a non-character so visually, the file was okay. But Ruby’s CSV
thought that there was something wrong because it assumed the file it was reading was UTF-8
and it couldn’t ignore Mr. U+FEFF
.
So lesson learned: don’t open (and save!) a CSV file in Excel if you want to feed it to Ruby’s CSV
parser.
If you do ever encounter an error like that, be sure to look for hidden characters not shown by your editor. If you still can’t see it and are using OSX, then pbcopy
and pbpaste
will help you out — they strip out any formatting or hidden characters from text in addition to copying and pasting it.