Measurement Errors in Data Extraction and Cleaning: How to Handle?

Arieda Muço
5 min readJul 30, 2023

--

Accessed from https://designmuseumfoundation.org/

Accessing and cleaning data can be a joyous or painful experience for numerous reasons. Here, I focus on the “pain” of handling data extracted from the web. As I previously discussed in this Medium article, the data was scraped from newspaper articles published by a leading Brazilian outlet.

In this post, I discuss the inaccuracies in these data sources, errors introduced during the extraction process, and more.

The Challenges

The data for this project, spanning over a decade, was obtained by scraping articles from the newspaper’s website. I was primarily concerned about three issues: 1) data gaps due to deactivated article links, 2) missing articles that the web crawler failed to capture, and 3) omission of chunks of text from the captured articles.

To confirm that, for each article, the full text was indeed retrieved, I repeatedly cross-checked random links and compared the article texts with their respective links. While it’s impossible to be 100% sure, I iterated this step several times, at some point we had to trust that the full text was extracted throughout the decade of interest.

Even assuming all links were available and the extraction process was flawless, the ultimate goal was to perform entity extraction, seeking occurrences of specific municipalities in conjunction with certain keywords or tokens. This opened the door to potential errors:

  1. Entity extraction inaccuracies: The choice of tool is crucial. I decided on Spacy due to its proven effectiveness.
  2. Misidentification of entities: An entity identified as a municipality might be something entirely different, like a street name or a government program.
  3. Keyword selection: The keywords of interest were related to corruption, but synonymous terms like ‘fraud’ might be used in articles. Deciding whether to use only root words related to corruption or building a classifier introduces another level of uncertainty.

Possible Solutions

Given these challenges, how do we address them? First, I employed a human-in-the-loop approach to cross-check the extracted data against the original text. Despite being labor-intensive, this process proved invaluable for better understanding the data.

Normalization: We can normalize the variables according to mentions in the pre-treatment period or use the Term-Frequency-Inverse Document Frequency (TF-IDF) approach to account for data variances. These solutions will help reduce the overrepresentation of certain entities in the text.

Domain knowledge is crucial here. Including the pre-treatment period is task-specific, as I am interested in the causal effect of my randomly assigned treatment on the mentions.

Indeed, in the context of statistical modeling, measurement error in the dependent variable, the outcome variable, is generally considered less problematic than measurement error in the independent variable. (See Jerry Hausman’s work on the matter.)

If the error, in the outcome variable, is well-behaved, has a mean of zero, and is independently and identically distributed (iid), we will only have to deal with wider confidence intervals. This is not the case if the error is in your main regressor, and/or violations of the above assumptions occur. See here for the course material, ​​and here if you’d like a challenge of measurement error in the running variable in a Regression Discontinuity setting.

Let’s see this with some Stata code under the assumption of a well-behaved measurement error. ‘Well-behaved’ here means that the error is iid with a mean of zero.

For instance, the measurement error shouldn’t differ significantly for treated municipalities compared to those that are not treated. In the second instance, the measurement error depends on the treatment.

clear
set seed 12345
set obs 1200
egen id = seq(), from(1) to(100) block(12)
egen month = seq(), from(1) to(12) block(1)
egen year = seq(), from(1) to(10) block(12)
bysort id year : gen treatment = rbinomial(1,0.5)
replace treatment =0 if missing(treatment)

gen outcome_monthly = 1.5 + 0.2*treatment + rnormal(0,1)

*Generate well-behaved error
gen measurement_error_w = rnormal(0,2)
gen noisy_outcome_monthly_w = outcome_monthly + measurement_error_w
reghdfe noisy_outcome_monthly_w treatment , a(id month year)

*Generate error that is dependent on the treatment
summarize outcome_monthly , d
gen measurement_error = rnormal(0,3) if treatment==1
replace measurement_error = rnormal(0.5,1) if treatment==0
gen noisy_outcome_monthly = outcome_monthly + measurement_error
reghdfe noisy_outcome_monthly treatment , a(id month year)

In this example, the treatment is simulated to have an effect of approximately 0.2 on the treatment. The results, with and without a well-behaved measurement error, are plotted below:

Plotted coefficient estimates, in blue, and 95% confidence interval, in gray. In orange, I plotted the true value.

As can be clearly seen from the plot, only the confidence interval of the first case includes the true value, and it is close to the true value, although statistically insignificant.

If we look at the result for the second case, where the error is not well-behaved, we might wrongly conclude that our treatment negatively affects the outcome. In fact, in the second case, the estimated coefficient is negative and statistically significant.

Lessons Learned and Advice

Measurement errors are inherent in empirical research. It’s crucial to understand how they might bias our results and to question whether the error behaves classically or not. Often, it doesn’t.

I find a few strategies to be beneficial:

Adaptation: Understand the intricacies of your data and be ready to adapt your methodologies accordingly.

Collaboration: Discuss your problems with colleagues or AI, or reflect on them through writing. Invite others to join your projects.

Peer Review: Share your code and data with others. They might provide alternative solutions to your problem, and constructive feedback is invaluable.

It’s important to remember that while humans are prone to errors, algorithms operate strictly on the principles we feed them. Therefore, open discussions and peer reviews are crucial components of scientific progress.

Note(s) to self

Humans are prone to making mistakes, algorithms do not. They are deterministic, executing commands precisely and consistently. However, this doesn’t mean they are flawless.

The output quality of any algorithm relies on two key factors: the quality of the data it receives and the logic behind it.

Take the time to familiarize yourself with the data and document all steps because measurement errors in the data can also arise as a result of badly designed or badly implemented algorithms.

--

--