Efficiently verifying the integrity of large CSV exports

Constant Oduol
Echo Mobile
Published in
3 min readAug 6, 2018

Some Background

Recently we experienced a customer issue that took us a long time to troubleshoot. Our customer heavily relies on exports from our platform in form of CSV (Comma separated values) spreadsheets.
The exports have been working fine as far as we know, but recently the customer reported that they were seeing unexpected duplicates in their exports. To prove their case further, they shared one of the suspect exports with us and just as they said there were duplicates.

We spent a considerable amount of time trying to reproduce the problem on our end but to no avail. Time moved fast and we came to the conclusion that we might never reproduce the problem in time. To increase customer confidence in the exports we started thinking of a way to verify the integrity of their exports before customers could download them.

How our exports currently work

To gain some perspective on how our exports work here is a brief overview. We currently have two main storage solutions. All our data lives in Cloud Datastore which is Google App Engine’s NoSQL database. We also have App Engine Full Text Search (FTS) which stores any data that should be customer searchable.

Customers can run exports from the Datastore or FTS. Exports are generated in parallel and written into many small files on Google Cloud Storage. Once the export is done the small files are merged into one file which the customer can download.

How we decided to verify the integrity of our exports

To verify that the export file we saved to Google Cloud Storage (GCS) was valid we had to read the export file and compare its contents to its source storage solution.

For simplicity I will focus on FTS(storage solution) only. Normally to get data from FTS you execute a query and then fetch the data returned in batches. So in our case to verify the integrity of an export, we had to read the export file from GCS and compare it to the data from FTS.

We can think of the data from GCS and FTS as two spreadsheets. If we can prove that the two spreadsheets have the same contents then the export is valid. To prove this we need to sort the two spreadsheets on the same column. Because of the nature of the reports, we are able to assume both spreadsheets have a column called phone (containing phone numbers). If we sort the spreadsheets on this column we can either compare the two spreadsheets cell by cell or we can compare only the phone columns. The latter is simpler.

Implementing export integrity verification on production

The first attempt using in memory sorting

We read from GCS and wrote the phone numbers to a list in memory, similarly we read from FTS and wrote the phone numbers to another list in memory. Then we tried sorting the two lists and bad things happened: we ran out of memory. Some more perspective, we run on 128MB instances (https://cloud.google.com/appengine/docs/standard/) on App Engine. We realized that sorting two lists both having greater than 15,000 records is not memory efficient on the size of instances we have . This was not going to work.

Using external sorting

Once we realized that we could not sort these large lists in memory, we changed tack. We created a data structure that split the list into segments of pre-specified size, in our case every segment contained a thousand items. So for a list of 14,000 items we had 14 segments. We only allowed 5 segments of the list in memory at a time, the rest of the segments were held in a table in the Cloud Datastore.

merge sort

Sorting the list could now be done in 2 passes, firstly, sorting the individual segments and then using merge sort to merge all the segments into one sorted list. As a side note, larger segments make the list more efficient to sort since they require fewer merges. This proved very effective and using this approach we were able to verify the integrity of our exports on production.

--

--