Streaming CSV report in Phoenix

Exporting some query result as CSV file is a very common task in web apps. Here I’m going to show to implement it considering latency, performance, memory usage and separation of concerns.

Naive approach is:

  1. Execute the query and get all rows to RAM: this is going to take time to fetch all of them so the user will wait and don’t get anything; also it takes some memory.
  2. Dump rows into some temporary file on disk: the user is still waiting and also it takes some disk space.
  3. Read the file into memory: the user is still waiting and we use some memory again.
  4. Respond with the memory contents: only at this moment the user starts to receive some data.
  5. Delete the temporary file on the server.

What we will do instead:

  1. Execute the query.
  2. Fetch some number of rows (100 or 500 for example) to RAM.
  3. Convert them to CSV in-memory.
  4. Send them to the user.
  5. Go to step 2 until all rows are over.

Pros:

  • Lower latency: we don’t need to wait for the whole CSV to prepare. Instead we can start sending data when we have only the first batch of rows ready.
  • Low memory consumption: we use only the amount of memory needed to process one batch of rows at the time.
  • No disk usage at all.

One more thing I’d like to have is separation of concerns for code readability. I find it clearer to separate business logic (acquiring the report data itself) and application logic (CSV format and HTTP response). So our business logic function will return a stream of report rows and the controller will take responsibility on streaming this data as HTTP chunks in CSV format without knowing anything about what is this data about and how we acquire it.

So let’s start with the controller:

Here at first we send some headers:

  • Content-Type: text/csv — well, it’s CSV
  • Content-Disposition: attachment; filename="report.csv" — this will tell the browser to download the content as a file with name report.csv and not show it on the page.
  • Transfer-Encoding: chunked — enable HTTP streaming. :ok means 200 HTTP status.

Next we call the business logic part and pass it the callback which gets the stream, takes rows in batches, convert them to CSV format and send them to the user.

Here I use NimbleCSV library to encode the list of lists of strings to CSV. Although CSV seems to be a very simple format actually there are some niceties concerning escaping. So let the library tackle them.

Now to the business logic part:

It’s pretty simple. Here we just execute some Ecto query to get the data. Note the usage of Ecto.Repo.stream/1 instead of Ecto.Repo.all/1. It creates a stream that we pass to the callback that we get from the controller. By default it fetches 500 rows per batch but this is configurable. Also note that Ecto.Repo.stream/1 must be enumerated only inside a transaction so that’s why it’s wrapped into Ecto.Repo.transaction/1 callback.

That’s it. Also often we may need to write some complex query for our report that could be not possible with Ecto and we have to write a raw SQL query for it. So here is an example for a raw SQL query:

Hope these examples will help someone. Happy coding!