Elixir and Data Ingestion

Lessons learned while parsing a few gigabytes of CSVs

Ryan Schmukler
Aug 17, 2016 · 5 min read

At Urbint, we assemble large amounts of data to build comprehensive APIs for machine learning and urban data. Lately I tried Elixir to see how it stacks up against Go (our current workhorse) for data ingestion. I decided to test it against a common task within our company — parsing multiple CSV files with a common key and then assembling them into a single record. A simplified example:

{"id": 1, "name": "Bob", "age": 30,
"favorite_colors": [ "Red", "Green" ]}

Lesson 1: If you have lots of records, in-memory probably isn’t your best option

Even with the above optimization, parsing the 350MB source file into 3,500,000 records resulted in ~2GB of RAM usage.

Because dynamic languages don’t know type information at compile time, they have to encode it within the variables themselves to be inspected at runtime. This incurs a bit of a memory cost. The “Erlang Efficiency Guide” documents the memory required by each primtive. Note that sizes are measured in words, which on 64-bit systems means 8 bytes per word.

With our CSV we are generating a 19 element tuple (18 fields +1 overhead for the record type, as per the Record module), and keeping each field as a binary. So, the calculation is as follows:

2x words for the record tuple      =  2 words
3x words/binary * 18 fields = 54 words
1x word for the record atom = 1 word
x 3,500,000 records = 199,500,000 words
x 8 bytes/word (on 64-bit erlang) = 1.596GB of RAM
+ 350MB of source binary data ----------------
= 1.94GB of RAM

Parsing a file into individual records resulted in over a 5.5x increase in memory need. Parsing our 2GB data set in parallel would have resulted in over 10GB of RAM needs. We weren’t prepared to dedicate so much memory to such a trivial task.

Lesson 2: ETS and DETS also use a lot of space

I decided to see if perhaps DETS had better compression ratios. Unfortunately, this was also not the case. The 350MB file mapped into aproximately 1.2 GB when indexed via DETS.

For us, ETS and DETS do not seem viable stores, so we ended up dumping the data into a RocksDB database. The result was the 350MB file being indexed as 164MB. This comes at the cost of speed (compared to in-memory) but these tasks were already long running. For reference, indexing the 3,500,000 record file takes 4 minutes on my solid-state 2013 2.4ghz MBP.

Lesson 3: Partial binaries can result in nasty leaks

message = "Hello World"part = :binary.part(message, 0, 5) # part == "Hello"Memory:   [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
message: [ H, E, L, L, O, , W, O, R, L, D ]
part : pointer to index 0 of message, length 5

This optimization also means that keeping a reference to part of the binary blocks the entire source from being freed. In our case, we were extracting a single field from a CSV, but keeping a reference to the field meant that we were preventing the entire line from being freed.

You can get around this by calling :binary.copy on the newly extracted part and keep a reference to the copy instead. This will copy only the data that you need, and allow the sub-binary to be freed, which in turn will allow the source to be freed.

Lesson 4: The garbage collector can be lazy

PJ Papadomitsos did an excellent writeup detailing his findings with BEAM’s garbage collection, but to summarize there are a few key takeaways:

  1. The garbage collector will not kick in until it faces system pressure. ie. until ~90% of the memory on the system is in use.
  2. When it does face pressure, the garbage collector often cannot kick in fast enough to beat the growing memory needs. This can result in a heap-allocation error from the OS resulting in the entire BEAM VM crashing.
  3. An out-of-band sweep can be forced by calling :erlang.garbage_collect(self()) in the process that the Refc binary originated from.

We experienced system crashes from processing multiple files in parallel when the garbage collector couldn’t keep up. Having to call the garbage collector by hand is a workable but ugly solution.

Lesson 5: The community is fantastic

Elixir has gained a lot of traction with a lot of Ruby/Rails developers who find familiarity in the syntax and structure offered by Elixir and Phoenix combined with functional style and better performance. At the same time, the distributed and concurrent nature of BEAM make it a tempting candidate for solving more than just the problems faced by web applications.

Every time I ran into issues #elixir-lang on Freenode was eager to help. Initially I ran into slow file IO, which lead to Eric Entin creating nifsy, providing NIF bindings to OS level file operations. CSV parsing performance was slower than what was expected, leading José Valim to write NimbleCSV — a CSV parsing library that keeps pace with Go’s standard library encoding/csv.

It is clear that the community is looking for more than just web applications to be implemented in Elixir.

Elixir for Data Ingestion?

Speed of development in Elixir is absurdly quick. Developer productivity and happiness is fantastic. We were able to prototype a concurrent ingestor with rate limiting, transparent database batching, and more; all in just over a week and a half. We had limited familiarity with the language, and had to go through a few WTFs (see above) to learn about the subtleties of BEAM. I expect it to be even faster next time.

Thanks to Russell Matney.

Ryan Schmukler

Written by

Founder and CTO at Urbint. Combining artificial intelligence and urban data.