Working With CSPro Data Using Python (Pandas)

Nahom Tamerat
5 min readOct 15, 2019

First off, what is CSPro?

CSPro, short for the Census and Survey Processing System, is a public domain data processing software package developed mainly by the U.S. Census Bureau. Its main purposes are entering, editing, tabulating, and disseminating data from censuses and surveys.

CSPro uses data dictionaries to provide a common description of each data file used. CSPro encodes data in UTF-8 format and has no database management capabilities. The CSPro framework includes a suite of tools including CSEntry and CSWeb. It is widely used (over 160 countries) in censuses and surveys.

This software is quite old (about two decades) and has various quirks and loads of technical debt. It stores questionnaire responses as a sequence of alphanumeric characters all lumped up together as a single blob of text.

101010300011021024101121951010121111 93292100000

Deciphering the value of each response (with the aim of representing the questionnaire as a table) will require the dictionary, which acts as a sort of “map”.

Basically, the dictionary is a list of questions and possible responses. Answers are coded as value-label pairs and what is stored as a response in the text blob is the corresponding value of the selected response, not the response itself.

For example, for a question that asks for marital status, the possible responses and their values could be:

Value=1;Never Married
Value=2;Married Monogamous
Value=3;Married Polygamous
Value=4;Widowed
Value=5;Divorced
Value=6;Separated
Value=9;DK

And what would be stored in the data string would be the value (1, 2, etc). So, while presenting the results of the analysis, one would possibly need to replace those values with their respective labels.

Motivation

Even though CSPro is an end to end solution, providing various tools for tabulating and analyzing data, one might want to have another means to introspect the data.

For example, during a census or a survey, one might want to have a dashboard, through which, some indicators and overall progress could be monitored.

In such a scenario, the CSPro tools would not be useful… and hence the need for other tools such as CSPro2sql, which transform the CSPro data to relational so that it could be queried (using SQL).

Python + Jupyter Notebooks

Python is arguably the best tool if you need to do some sort of data wrangling. It has versatile data structures and is very fast. Pandas (A Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.) would also add great power and capability to the task of analyzing and manipulating the data.

Parsing the Dictionary

The first order of business was to read in and parse the dictionary (either from file or from the CSWeb database, which is MySQL). Parsing needed to be done so that the parsed dictionary could be used to cut up the cases into their constituent columns and various other tasks related to the data such as preplacing the values of responses by their labels.

Here is a small chunk of an example dictionary that comes with CSPro. You can download it from here

To ensure that the CSPro specific “syntax” is followed and to also capture their hierarchical (tree?) nature, a Finite State Machine (FSM) was needed. A quick search led to Transitions, which is one of the best FSM implementations for Python. It’s a joy to use and the documentation has plenty of examples (but lacks a comprehensive list of APIs).

CSPro dictionaries are basically a linear sequence of sections, however conceptually, it is supposed to be a hierarchical tree. Therefore, there is a sort of convention that dictates the order of appearance of the sections and it is that knowledge that is encapsulated in the FSM.

This is the state diagram for the machine that parses the CSPro dictionary.

Certain choices were made in writing the state machine such as assuming that dictionaries will only have a single Level and will also not have sub-items. This was for the sake of expediency, simplicity and only fulfilling current needs.

Once you put your dictionary through the parser, you will end up with a Python dictionary object that is deeply nested.

A conceptual representation of the parsed CSPro dictionary as represented in Python using the dictionary object

And here is a screenshot of the output (a piece of it) of the parsed dictionary. Please note that the same dictionary (from the CSPro example) is being used throughout.

Generating Columns from Cases

Once we have a parsed dictionary, the next order of business would be to try to use that dictionary to cut up the response/case string (the text blob) into the various individual responses.

If you look at the following image, you could see how CSPro stores questionnaire responses.

This represents only the beginning of the case, it could go on for much longer and could also constitute various records, each separated by a newline character (\n)

Writing the case parser was quite challenging. This was because a case could be made up of multiple records. A record is a group of related data items. For example, a census might have various records such as population record, housing record, death record, etc. In addition, a single case can have multiple entries for each record. In a single household, there would be multiple population records and a single housing record.

The aim was to return a dictionary that could trivially be converted into a Pandas Data Frame. Therefore, the case parser will return a Python dictionary that can be directly passed to the Pandas from_dict method which will then construct a Data Frame from it.

case_as_dataframe = pd.DataFrame.from_dict(parsed_case)

Once both the dictionary and case parsers where implemented, all that remained was to package it up and add it to the PyPi Package Index. That was simply done by following this guide from the official Python documentation.

In a matter of minutes, you can have your CSPro data looking like this!

The resulting package is pyCSPro which you can install by typing

pip install pycspro

If you happen to find any bugs or if you have any comments and suggestions, please feel free to get in touch

There is also a Jupyter Notebook on Binder (great project!) that you can play with, in a live environment (in your browser) and see how easy it is to use this library. Please, take it for a spin!

Once you have your data as a Pandas Data Frame, the world is your oyster!

You can reach me on twitter @nahomt

--

--

Nahom Tamerat

Full stack developer (Laravel, React Native, Node, …), book worm, electronics hobbyist, wood worker, maker and tinkerer.