An Algorithmic Approach to Correct Bias in Urban Transportation Datasets

CDS’ new faculty member Julia Stoyanovich contributes to research that aims to remove specific biases from aggregated datasets

While a significant amount of attention and research has addressed individual privacy concerns in private companies’ datasets, data owners and publishers also want to avoid revealing certain patterns — even in anonymized datasets — that might compromise a competitive advantage or perpetuate discrimination against any group of people. Data published by urban transportation companies is highly valuable for research, policy, and public accountability. Yet concerns over revealing patterns of bias, even when the company is not responsible for these patterns, often hamper efforts to increase transparency.

To assuage these concerns for private urban transportation companies, researchers have developed an algorithm that removes selected biases from datasets while retaining the utility of the data. The researchers include Julia Stoyanovich*, CDS Assistant Professor of Data Science, Computer Science and Engineering, along with Luke Rodriguez, Babak Salimi, and Bill Howe from University of Washington, and Haoye Ping from Drexel University.

They evaluated their algorithm by applying it to two real datasets, a dockless bike share program in Seattle and taxi data in New York, and one synthetic rideshare dataset. Their goal was to remove sensitive causal dependencies — patterns that might reveal strategy, violate contracts, or misrepresent the behavior of a certain demographic — without affecting relationships in the data for other types of useful analysis.

The researchers augmented the real Origin-Destination datasets (typical format for transportation data: sets of location pairs with traffic flow between locations) with metadata (gender, company, origin/destination attributes) and applied their algorithm. To ensure that the resulting bias-corrected datasets were still usable, they calculated whether the distribution of traffic had been affected “too much,” a calculation they determined by comparing samples from the original and adjusted datasets.

For the synthetic rideshare dataset, the task was to remove the causal influence of gender on rating; for the Seattle bike share dataset, the task was to remove the effect of company on gender, but not the overall pattern of gender on ridership; for the NYC taxi dataset, they remove the influence of distance on tip amount. Overall, the researchers found that the error introduced by their algorithm was less than what would be expected from sampling.

In the future, Stoyanovich and her colleagues hope to apply their algorithmic approach to other domains for preprocessing data to be publicly released. Their “broader vision is to develop a new kind of open data system that can spur data science by generating safe and useful synthetic datasets on demand for specific scenarios, using real data as input.”

*This work was completed while Julia Stoyanovich was still at Drexel University. It is part of an NSF-funded project on foundations of responsible data management and follows work on privacy-preserving synthetic data generation. See here and here to read more about that work.

By Paul Oliver