Stream CSV Files In Rails Because You Can

Using Rails and Ruby Enumerators to start sending data faster

It’s a stream, kind of

I was reading the ActiveAdmin docs, as one does, when I read that ActiveAdmin, by default, streams CSV data when you request it from your browser, rather than sending it all in one chunk. This means that the Rails server can start sending the CSV data to the client while the file is still being generated. This can make the response much faster. (True fact, I was reading it for my mega-hot best seller Take My Money: Accepting Payments on The Web. What do CSVs have to do with money? Read the book. Okay, fine, CSV’s are a great input format for spreadsheets, and spreadsheets are an amazingly powerful analytic engine, so it’s a quick and cheap way to do reporting.)

I was only dimly aware that this was possible in Rails, so I dug in to see how they did it. It turns out that it’s a mix of some fun Ruby tricks and some HTTP header magic that may or may not be superstition. Let’s take a look.

Suppose we were generating some CSV code that looked like this.

class CsvBuilder

attr_accessor :output, :header, :data

def initialize(header, data, output = "")
@output = output
@header = header
@data = data
end

def build
output << CSV.generate_line(header)
data.each do |row|
output << CSV.generate_line(row)
end
output
end
end

This is a simplified class that takes an array of strings as the header and a 2-dimensional array of values. The build method uses CSV.generate_line to create the CSV string for the header and then for each row.

What might seem odd to you is the use of an output variable, which is by default an empty string, and the push (<<) operator to build up the string. This syntax looks a little weird when applied to a string, but it works, and it enables us to call the builder with an output variable that isn’t a string. For example, it could be an Enumerator.

def build_csv_enumerator(header, data)
Enumerator.new do |y|
CsvBuilder.new(header, data, y)
end
end

Now the CsvBuilder will push all those headers and rows onto an Enumerator object. If we call build_csv_enumerator, we can then access the rows one by one:

enum = build_csv_enumerator(header, data)
enum.next
enum.next

There are a lot of cool things we can do with an Enumerator, such as chain it with other enumerator methods, see this blog post for a good introduction to what an Enumerator is and does in Ruby.

For our purposes at the moment, the useful part is that a Rails controller can be given an Enumerator and will use it to deliver the content element by element. Here’s an example.

class ReportsController < ApplicationController

def show
respond_to do |format|
format.html \{\}
format.csv do
headers["X-Accel-Buffering"] = "no"
headers["Cache-Control"] = "no-cache"
headers["Content-Type"] = "text/csv; charset=utf-8"
headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
headers["Last-Modified"] = Time.zone.now.ctime.to_s
self.response_body = build_csv_enumerator(header, data)
end
end
end

private def csv_filename
"report-#{Time.zone.now.to_date.to_s(:default)}.csv"
end
end

The most important line here is the self.response_body = build_csv_enumerator. By setting the response body directly to an enumerator (rather than letting Rails set it implicitly via rendering a template or something), Rails will use the enumerator to send the data element by element, calling next on the enumerator to get the next chunk of data.

We also send a bunch of headers, the list of which I sort of put together from various bits of example code that I saw in various places. The upshot is that the headers keep the server from buffering or caching. Buffering is when the server saves up data so as to send it out in fewer, but larger transactions. Normally that saves time, but since the whole point of streaming this way is to start sending data quickly, in this case we’d rather just start sending stuff. The cache changes are attempts to get various server configurations to avoid caching, which, again will cause the output to be sent in larger amounts.

To generate a CSV file, though, you probably still need to go to the database to get the data, which can also be slow. If you are making a large database call to start your CSV generation, then you can still use Rails and Enumerators to help speed up the start of the database load, and therefore the start of your CSV data stream.

ActiveRecord defines the find_each method, which is a replacement for all that retrieves items from the database in batches of, by default, 1000 rows, so as to prevent long database transactions from locking the database.

def self.csv_collection
Model.find_each.lazy.map do |model|
model.to_row
end
end

If we use this method as the data input input to our build_csv_enumerator method…

build_csv_enumerator(header, Model.csv_collection)

…then we not only get one line at a time behavior for creating the CSV, we also get it for the data access.

What happens is that the Model.csv method returns an Enumerator that has been decorated with the lazy method. By declaring it lazy, we ensure that Ruby will not try to get the next element of the enumerator until we ask for it.

The data flow goes like this:

  • We call build_csv_enumerator with a header row and a lazy enumerator as the data.
  • The CSVBuilder class is initialized using a new enumerator as the output.
  • As the response body is processed by Rails, it continually asks for the next element of the output enumerator.
  • In order to generate the next row, the CSV builder asks for the next data element, which causes the lazy data enumerator returned by find_each to generate its next model.to_row result.
  • When the lazy data enumerator runs out of objects, find_each asks the database for the next batch, and we continue on.

With those pieces in place, we can send the first row of the CSV down to the browser without having to generate the entire file, and without retrieving more than the first batch of database records. While this may not make the entire download faster, it will prevent a long pause before the download begins, and on some servers can prevent timeout errors on long reports.