CSV Testing Tools for Open Data

Aaron Steffy Couch
3 min readMay 11, 2016

--

TLDR; I’ve found both validation and sampling tests extremely useful when parsing CSV files and have provided an example.

CSV files are still a standard for sharing Open Data. They are easy to read and parse and not locked to a proprietary toolchain. When parsing less open formats like XLS or PDF files into CSVs I have found myself wanting to test the outputs for obvious reasons. This is especially true when parsing large numbers of files, working on an evolving script or tool, or working in large groups. I want to make sure large files have the data I expect and changes or enhancements to a script don’t break previous outputs.

I’ve created a repo, csv-testing-tools, that collects two strategies for testing CSV files. The repo has the following structure:

├── example-election-data.xlsx       (sample data to be parsed)
├── example-election-result.csv (sample results)
├── example-parser.py (sample parsing script)
├── package.json
├── requirements.txt
├── .travis.yml
└── tests

├── csv-test-config.yml (additional csv-test validator)
├── csv-test-validators.yml (validation rules)
└── features
├── elections.feature (behave test feature)
└── steps
└── steps.py (behave step definition)

I’ve included an example workflow that I’ve adopted from the Open Elections project. The script itself is hot garbage. The important part is running the example-parser.py ouputs a parsed CSV file:

python example-parser.py example-election-data.xlsx example-election-result.csv

The example-election-result.csv file has the following format:

county,ward,office,district,total votes,party,candidate,votes

I’ve adopted two strategies for testing the output of this file: validating the results and testing samples of the results.

Validation Tests

For validation I’ve used csv-test. (EDIT: For those who want to stick to a Python stack check out: https://github.com/di/vladiate) The csv-test-config.yml file in my example repo contains the validation rules. Unfortunately since election results often don’t have standards I’ve had to use regex for a number of the fields. I’ve also created additional validator, isOneof, as an example which validates that the party results are from a set of defined options “DEM, REP, IND, CON, NP, LIB, NA, WIG, NA, GRE, WGR.”

I’ve also created a PR on the csv-test library that allows checking of entire folder of CSV files instead of having to name them individually.

The result of the test ensures that the results in the total votes and votes columns are numbers and that the results in the district column are either a number or empty, and so on with the rest of the columns.

Testing Sample Data

The validation ensures that the results in each row are in an expected format but not that they are accurate or correctly reflect the source. There are many ways to test a sample of results to ensure the expected outcome is met. I chose BDD tests since I wanted an easy and visual way to map the expectation to the result. Every language has their own BDD implementation. I could have gone with Vows if I wanted to stick to node to be consistent with my validation test. However most of the scripting I do is in python as well as the parser so I went with Behave.

In the case of election results I wanted to verify that the results in the XLS format:

are properly output in the CSV. To do this I’ve created an elections.feature with the following scenario:

I then provide examples for each file I am parsing. In the example case:

Once this is setup it is easy to add additional files and sample results.

Summary

These are two strategies I’ve found useful. I’d be interested to hear other strategies or tools folks have used when testing CSV files.

--

--

Aaron Steffy Couch

creator of #DKAN. it is crazy that we spend public funds on private code.