Elixir and Data Ingestion

Lessons learned while parsing a few gigabytes of CSVs

At Urbint, we assemble large amounts of New York City’s data to build a comprehensive API for NYC real-estate. Lately I tried Elixir to see how it stacks up against Go (our current workhorse) for data ingestion. I decided to test it against a common task within our company — parsing multiple CSV files with a common key and then assembling them into a single record. A simplified example:

people.csv
--------
id,name,age
1,Bob,30
favorite_colors.csv
-------------------
person_id,color
1,Red
1,Green
output.json
-------------------
{"id": 1, "name": "Bob", "age": 30,
"favorite_colors": [ "Red", "Green" ]}

Lesson 1: If you have lots of records, in-memory probably isn’t your best option

Initially, I was going to hold all the records in RAM. I decided to test parsing a 350MB CSV which had 3,500,000 lines divided into 18 fields. To save on memory I decided to use Elixir’s Record module. The record module generates macros to map atoms into element indices within tuples. This lets you use a tuple like a map, using atoms to access data, but saves the memory overhead of 18 * 3,500,000 keys in a map.

Even with the above optimization, parsing the 350MB source file into 3,500,000 records resulted in ~2GB of RAM usage.

Because dynamic languages don’t know type information at compile time, they have to encode it within the variables themselves to be inspected at runtime. This incurs a bit of a memory cost. The “Erlang Efficiency Guide” documents the memory required by each primtive. Note that sizes are measured in words, which on 64-bit systems means 8 bytes per word.

With our CSV we are generating a 19 element tuple (18 fields +1 overhead for the record type, as per the Record module), and keeping each field as a binary. So, the calculation is as follows:

2x words for the record tuple      =  2 words
3x words/binary * 18 fields = 54 words
1x word for the record atom = 1 word
-------------------------------------------------------
x 3,500,000 records = 199,500,000 words
x 8 bytes/word (on 64-bit erlang) = 1.596GB of RAM
+ 350MB of source binary data      ----------------
= 1.94GB of RAM

Parsing a file into individual records resulted in over a 5.5x increase in memory need. Parsing our 2GB data set in parallel would have resulted in over 10GB of RAM needs. We weren’t prepared to dedicate so much memory to such a trivial task.

Lesson 2: ETS and DETS also use a lot of space

I decided to see if storing the records in ETS instead of my own stucture would yield any results. The answer: Not really. Turning on the :compress option for the ETS table resulted in ~350MB of memory savings, but nothing remarkable.

I decided to see if perhaps DETS had better compression ratios. Unfortunately, this was also not the case. The 350MB file mapped into aproximately 1.2 GB when indexed via DETS.

For us, ETS and DETS do not seem viable stores, so we ended up dumping the data into a RocksDB database. The result was the 350MB file being indexed as 164MB. This comes at the cost of speed (compared to in-memory) but these tasks were already long running. For reference, indexing the 3,500,000 record file takes 4 minutes on my solid-state 2013 2.4ghz MBP.

Lesson 3: Partial binaries can result in nasty leaks

BEAM optimizes sub-sections of binaries (called either match contexts or sub binaries) allowing them to share underlying data, resulting in less memory need for partial binaries, and less pressure on the garbage collector. Consider:

message = "Hello World"
part = :binary.part(message, 0, 5) # part == "Hello"
Memory:   [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
message: [ H, E, L, L, O, , W, O, R, L, D ]
part : pointer to index 0 of message, length 5

This optimization also means that keeping a reference to part of the binary blocks the entire source from being freed. In our case, we were extracting a single field from a CSV, but keeping a reference to the field meant that we were preventing the entire line from being freed.

You can get around this by calling :binary.copy on the newly extracted part and keep a reference to the copy instead. This will copy only the data that you need, and allow the sub-binary to be freed, which in turn will allow the source to be freed.

Lesson 4: The garbage collector can be lazy

BEAM keeps binaries over 64 bytes, called Refc binaries, in a separate heap from normal process variables. This means that these variables don’t follow the traditional garbage collection patterns.

PJ Papadomitsos did an excellent writeup detailing his findings with BEAM’s garbage collection, but to summarize there are a few key takeaways:

  1. The garbage collector will not kick in until it faces system pressure. ie. until ~90% of the memory on the system is in use.
  2. When it does face pressure, the garbage collector often cannot kick in fast enough to beat the growing memory needs. This can result in a heap-allocation error from the OS resulting in the entire BEAM VM crashing.
  3. An out-of-band sweep can be forced by calling :erlang.garbage_collect(self()) in the process that the Refc binary originated from.

We experienced system crashes from processing multiple files in parallel when the garbage collector couldn’t keep up. Having to call the garbage collector by hand is a workable but ugly solution.

Lesson 5: The community is fantastic

Elixir is a relatively young language. The community is excited to have new members. They want the language to succeed, and they want it to be used for interesting problems.

Elixir has gained a lot of traction with a lot of Ruby/Rails developers who find familiarity in the syntax and structure offered by Elixir and Phoenix combined with functional style and better performance. At the same time, the distributed and concurrent nature of BEAM make it a tempting candidate for solving more than just the problems faced by web applications.

Every time I ran into issues #elixir-lang on Freenode was eager to help. Initially I ran into slow file IO, which lead to Eric Entin creating nifsy, providing NIF bindings to OS level file operations. CSV parsing performance was slower than what was expected, leading José Valim to write NimbleCSV — a CSV parsing library that keeps pace with Go’s standard library encoding/csv.

It is clear that the community is looking for more than just web applications to be implemented in Elixir.

Elixir for Data Ingestion?

It may seem like I listed a lot of problems with Elixir that make it a poor candidate for data ingestion. That isn’t true. These are the nuances to be aware of in BEAM when solving these types of problems. The obstacles are easy to avoid, as long as you know they are present.

Speed of development in Elixir is absurdly quick. Developer productivity and happiness is fantastic. We were able to prototype a concurrent ingestor with rate limiting, transparent database batching, and more; all in just over a week and a half. We had limited familiarity with the language, and had to go through a few WTFs (see above) to learn about the subtleties of BEAM. I expect it to be even faster next time.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.