Over the past five years, Pew Research Center’s Data Labs team has worked hard to steadily advance the Center’s data science capabilities. From text analysis to computer vision, we’ve applied a variety of computational methods to study important social issues in new ways and expand the scope of what’s possible for the Center. In doing so, we’ve written a lot of code.
In the spirit of our commitment to transparency and our desire to provide methodological resources to the public, we’re excited to release a collection of Python tools that we’ve found ourselves returning to again and again.
If you’ve ever been frustrated by wrangling a bunch of files or cleaning up text documents, we’re hoping these tools will help make your life a little easier. We’ve split this release into two packages on the Center’s GitHub page: one for utilities that can be applied to any programming project, and another with tools that are specifically catered to data processing and analysis.
Shortly after we got started on our first data science project, we noticed that we needed to perform a number of tasks on a regular basis, like loading and merging data in different formats or checking for null values and standardizing URLs. We often found ourselves borrowing code from each other for these routine tasks, and we started making a habit of cleaning up and generalizing these useful tidbits after each project so we could easily leverage them in the future. We’ve now gathered the most basic of these functions into a package called “Pewtils” (short for “Pew Utilities”), a collection of programming utilities that have a broad range of possible applications. Pewtils ensures that we have a consistent, canonical way of doing common tasks, and we use it on a daily basis. Here are a few highlights:
is_not_null: When dealing with multiple data sources, it’s not uncommon to find null values represented in many different ways:
NoneType, empty strings and so on. These functions check for these and other null formats and allow for the inclusion of empty lists and Pandas DataFrames and other custom values.
FileHandler: We work with data in a variety of formats: pickle files, JSON, CSVs, Excel spreadsheets, etc. The
FileHandlerclass provides a standard interface for reading and writing these and other types of files, and integrates easily with Amazon’s S3 cloud storage.
canonical_link: When working with data from the web, we often encounter links to news articles, social media profiles and other useful pages. But since URLs can come in many different forms that all point to the same destination (
pwrs.ch/example), it can be challenging to analyze these data. The
canonical_linkfunction is our best attempt at resolving and standardizing URLs into their true form.
On the Data Labs team, our work doesn’t end after we’ve looped over and loaded in a batch of files and filtered out null values. Many of our research projects also involve routine tasks specifically related to processing and analyzing data, like cleaning up text documents, removing duplicate records and looking for hidden clusters and groups. Our new Pew Analytics package contains a collection of tools designed to make these tasks easier, including:
TextCleaner: There are a variety of ways to preprocess text data and prepare it for analysis. The
TextCleanerclass helps us do so in a standardized way and includes options for stemming or lemmatization, contraction expansion, cleaning out bits of HTML and filtering parts of speech.
TextDataFrame: When exploring a collection of documents, we often work with the data using a Pandas DataFrame that contains both text and metadata about the documents we’re analyzing. The
TextDataFrameclass provides a variety of useful functions for working with data in this form, allowing us to easily compute document similarities, match documents to those in another Pandas DataFrame, find potential duplicates and identify repeating fragments of text, distinctive words and clusters of documents.
compute_scores: Whether we’re conducting a traditional human-coded content analysis or training a machine learning model, we’re often comparing classifications against each other to compute metrics of inter-rater reliability like Cohen’s Kappa and Krippendorf’s Alpha, or measures of model performance like precision and recall. Our
compute_scoresfunction takes a DataFrame of classification decisions and produces a clean table with these and other scoring metrics.
Over the coming months, we’ll be taking a deeper dive into these packages, highlighting how to use them and explaining how they might help you in your programming endeavors. And Pewtils and Pew Analytics are just two of many internal libraries that we hope to share with the Python community and continue developing in the months ahead. In the meantime, we encourage you to stay tuned, explore the
pewanalytics documentation and follow us on GitHub!
To get started with the Pew Python libraries, just install them using
pip install git+https://github.com/pewresearch/pewtils#egg=pewtils
pip install git+https://github.com/pewresearch/pewanalytics#egg=pewanalytics
Patrick van Kessel is a senior data scientist at Pew Research Center.