Introducing Pew Research Center’s Python libraries

(Pew Research Center illustration)

Pewtils

  • is_null/is_not_null: When dealing with multiple data sources, it’s not uncommon to find null values represented in many different ways: numpy.nan, NoneType, empty strings and so on. These functions check for these and other null formats and allow for the inclusion of empty lists and Pandas DataFrames and other custom values.
  • FileHandler: We work with data in a variety of formats: pickle files, JSON, CSVs, Excel spreadsheets, etc. The FileHandler class provides a standard interface for reading and writing these and other types of files, and integrates easily with Amazon’s S3 cloud storage.
  • canonical_link: When working with data from the web, we often encounter links to news articles, social media profiles and other useful pages. But since URLs can come in many different forms that all point to the same destination (http://pewresearch.org/example, https://www.pewresearch.org/example, pwrs.ch/example), it can be challenging to analyze these data. The canonical_link function is our best attempt at resolving and standardizing URLs into their true form.

Pew Analytics

  • TextCleaner: There are a variety of ways to preprocess text data and prepare it for analysis. The TextCleaner class helps us do so in a standardized way and includes options for stemming or lemmatization, contraction expansion, cleaning out bits of HTML and filtering parts of speech.
  • TextDataFrame: When exploring a collection of documents, we often work with the data using a Pandas DataFrame that contains both text and metadata about the documents we’re analyzing. The TextDataFrame class provides a variety of useful functions for working with data in this form, allowing us to easily compute document similarities, match documents to those in another Pandas DataFrame, find potential duplicates and identify repeating fragments of text, distinctive words and clusters of documents.
  • compute_scores: Whether we’re conducting a traditional human-coded content analysis or training a machine learning model, we’re often comparing classifications against each other to compute metrics of inter-rater reliability like Cohen’s Kappa and Krippendorf’s Alpha, or measures of model performance like precision and recall. Our compute_scores function takes a DataFrame of classification decisions and produces a clean table with these and other scoring metrics.
pip install git+https://github.com/pewresearch/pewtils#egg=pewtils
pip install git+https://github.com/pewresearch/pewanalytics#egg=pewanalytics

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store