Using Machine Learning for Precision Nudging: Jens Ludwig at Princeton

Behavioral nudges are good at maximizing a specific outcome; can machine learning nudge people to make better decisions?

As governments use machine learning algorithms to personalize how they treat citizens, how might the behavioral sciences shape how these systems are used?

Speaking at Princeton’s Behavioral Policy Center today is Jens Ludwig, a professor at the University of Chicago, the director of the University of Chicago Crime Lab, and the co-director of the University of Chicago Urban Education Lab.

Jens Ludwig at the Kahneman-Treisman Center for Behavioral Public Policy

For the last ten years, Jens has been working in the Chicago Urban Labs, which tries to make research useful for policymakers. Jens was trained as an economist, and he was trained to think about policy in terms of changes in incentives. With behavioral science, we now have more complex models of behavior that are useful to policymakers in areas ranging from energy and water use to public space, and we’ve seen researchers make substantial contributions to many high-stakes policies.

Normally, an experimenter would look for the most effective intervention and apply it to everyone. Why should one size fit all?

To illustrate the value of behavioral science to policymakers, Jens talks about a recent study by Fishbane, Ouss, Shah, and others, who looked at failures to appear in court. Typically, the system thinks about failures to appear as disobedience, so courts give people a penalty and issue a warrant for arrest. Very quickly, minor interactions with the criminal justice system can spin out of control. Behavioral scientists would see this problem related to attention– maybe instead of just fining people, you could text them with information about how to make a plan. They found that compared to sending no message, you can reduce rates of failure to appear by non-trivial amounts by sending people reminders.

Normally, an experimenter would look for the most effective intervention and apply it to everyone. But over time, Jens and his colleagues asked: “Why should one size fit all?” A text message is great for someone who’s always online. It’s less good for someone who struggles to stay on top of electronic communication.

Are Personalized Policies the Second Behavioral Revolution?

That’s why Jens and his colleagues are talking about what they call a “second behavioral revolution” — using big data and machine learning to offer personalized interventions to people. To start, Jens offers to introduce machine learning, and then imagine the new research and questions we need to ask as these personalized policies become more common.

Why Algorithms from Non-Experts Are Useful for Policy

To illustrate what machine learning does, Jens tells us about sentiment analysis– computer programs that try to infer the affect of what someone is saying. He shows us examples of Amazon reviews and talks about how companies often try to infer whether the review is positive or negative. On Amazon, you get a numerical “ground truth” of a five-star rating. Statistical machine learning systems take this kind of training dataset and then use it to train an algorithm to detect reviews that are similar to positive reviews it’s seen in the past.

Since machine learning systems don’t model the assumptions of their designers, it’s fine if designers are unaware of the political and social uses of their creations

As Jens tells us this story, he shows how badly AI systems perform if they are built on human expertise, versus just training them on evidence. He often mentions this example in policy conversations as a way to respond to people who critique the systems because the engineers might know nothing about the setting in which the algorithm is being employed. Since machine learning systems don’t model the assumptions of their designers, it’s fine if designers are unaware of the political and social uses of their creations, Jens says.

How Personalized Policies Change How We Think About Nudges

Jens points out that many behavioral nudges to date have been used in settings where everyone agrees that increasing something is a good thing: we want more people to floss or to wear their seatbelts. But in cases like the decisions of judges, we don’t actually want to maximize a particular kind of behavior: we don’t want judges to release everyone, nor do we want them to increase the number of people they put in jail. Some people say that judges are too harsh. Others say that judges are too lenient. It’s clear that judges decisions aren’t optimal, but it’s possible that they could be non-optimal in both directions.

With a machine learning aid to human decisions, it might be possible to provide case-by-case nudges to judges

With a machine learning aid to human decisions, it might be possible to provide case-by-case nudges to judges. This will only help if the algorithm’s predictions are more accurate than humans. In a new study with Jon Kleinberg, Jure Leskovec, and Sendhil Mullainathan, they show that building the algorithm is much easier than evaluating the system.

To illustrate this challenge, Jens describes an algorithm they built for judges in a large U.S. city. In the project, they created a training dataset of a half million observations, and they then evaluated it on a hold-out set. In this city, state law says that judges are only allowed to look at flight risk- the chance that the defendant will skip a key court appearance (failure to appear). To predict flight risk, they used a historical dataset of people who were released by judges, and who either showed up or didn’t. They observe information about the person’s current offense, prior record, and age. In an initial model, they avoided using information about race, ethnicity, gender, or neighborhood of residence.

Jens asks us to imagine that you have a group of teenage defendants, and suppose that the judge sees something like a tattoo that ends up being highly predictive of a flight risk. Imagine that the judge uses that information to make decisions and detains all teenagers with tattoos. In that case says Jens, the training dataset would be biased, mislead the algorithm, and potentially lead to more people failing to appear and committing further crimes. It might be better for judges and algorithms to make decisions based on someone’s tattoo than to ignore that detail.

To address this issue, the researchers looked at the history of decisions by specific judges and classified how lenient they were. They then use this leniency information using econometrics methods to overcome the selective labels problem (I haven’t figured out how to type fast enough to capture econometrics details with accuracy– more information is in the paper here).

Jens shows us the predicted risks from their model. First, judges are detaining around 10% of the lowest risk people, according to their algorithm. Second, he argues, judges are also releasing a lot of high risk people. It’s hard to look at these results and conclude that judges aren’t making a mistake.

Next, Jens asks us to imagine a decision aid based on this algorithm. What does success look like when there’s not a right answer in how you should trade detention off for crime?

Computer scientists will tend to train an algorithm that uses weights for errors in both directions-they create costs for what happens if your algorithm makes a mistake.

rather than try to change the detention rate, an algorithm could advise judges to put the same number of people in jail while reducing the crime rate. Or it could keep the crime rate constant while reducing detention rates

In this paper, Jens and his colleagues try something else, making the following argument: rather than try to change the detention rate, an algorithm could put the same number of people in jail while reducing the crime rate (e.g. people failing to appear). If you did that, the machine learning algorithm could reduce crime by 25% without changing the rate at which a judge detains people. Alternatively, if you wanted to reduce detentions while keeping the crime rate the same, you could reduce the detention rate by 42% without any change in court hearings that are skipped.

This seems exciting for policymakers- judges are making mistakes and if we can improve predictions, we could improve the justice system. At the same time, many people question the fairness of these systems. In an analysis, the researchers found that with an algorithm blinded to race and ethnicity, the algorithm can be fairer than judges.

algorithms aren’t black boxes in the same way as a human mind–we have the power to design & adjust the algorithm. We can use that power to create fairer systems

Algorithms blinded by race, ethnicity, and gender follow principles from U.S. that restrict the uses of this information for description by humans when making decisions. But algorithms aren’t black boxes in the same way as a human mind, says Jens– we have the power to design and adjust the algorithm. We can use that power to create fairer systems.

Should Algorithms Be Blind to Race, Gender, and other Protected Characteristics?

In a world where we care about fairness, should we give algorithms access to race, gender, and other protected characteristics? Jens tells us about an upcoming project to predict college admissions decisions. Imagine if we treat every person who applies to college as part of the applicant pool, and imagine that we want to admit the top 50% of people in terms of their predictive performance in college. Imagine that the outcome we care about is the risk that the student will show up and not do well (having a G.P.A. below a B average). We then build a risk model that predicts GPA.

Next, Jens asks us to imagine two decision-makers: the efficient planner only cares about admitting the strongest academic class, and the equitable planner cares about fairness. We allow both planners make decisions using three algorithms: (a) a race-blind algorithm, (b) orthogonalize inputs to race, and (c) create a race-aware algorithm.

Using this algorithm, the efficient planner will always choose the race aware algorithm, which maximizes GPAs, says Jens. Now imagine an equitable planner, someone who cares about fairness. Imagine that the equitable planner has all three algorithms. In this paper, he and his co-authors argue that the equitable planner will also use a race-aware algorithm, one that will admit more black students and provide better GPAs than the others.

Why might the race-aware algorithm do better? Jens argues that the race-blind algorithm does poorly within groups– a B might mean something different for a black person than someone else, says Jens. If you predict college success and blind the algorithm to race, the system will mis-predict GPAs for people within groups. Any prediction function that leads a lower ranked minority to be allocated over a higher-ranked minority can’t be optimal. To get that right, says Jens, you need an algorithm that uses race as a variable.

The right thing to do from a fairness perspective, says Jens, is out of bounds under current legal structures–he encourages the creation of algorithms that make decisions based on race, gender, and other characteristics.

we need new research to figure out how to get people to use machine learning tools most productively

Frontiers For Using Machine Learning in Policy and Decisions

Many policy decisions have a similar structure, according to Jens, where it’s important to make a high quality decision and not just maximize a particular outcome– like teacher hiring and medical decisions. Recently, Jens has been working on research to train machine learning models on citizen complaints about police officers. The system will then support sergeants to decide which officers might be likely to create adverse police outcomes (like shooting unarmed black people) and who may need support before a tragedy happens.

Overall, we are building systems that society is deciding can outperform humans, and we’re designing ways to evaluate these systems. As we do so, Jens tells us, we’re going to need new research to figure out how to get people to use machine learning tools most productively- and that’s something that we need behavioral scientists to lead.