Predicting the Path of Congressional Bills

7 min readMar 11, 2020

Today we’re going to use machine learning to predict how far our friend, Bill, will get in the legislative process. As you can see in the diagram above, Bill has many hurdles to cross after being introduced.

In a previous post, I describe how I scraped, cleaned, and wrangled the data used for this analysis. I now have over 100,000 bills from 20 years (or 10 Congresses) to work with. For each bill I have a variety of information including its sponsors, subjects, and committee assignments. I assigned each bill one of five final statuses, based on each bill’s last major action.

The tree diagram below show the percentage of bills that ended up in each of the five final status bins.

It’s a sad truth for Bill that 82.5% of bills receive no further consideration after being referred to committee. And only 3.4% of bills actually make it to the finish line and become a full-fledged law. It’s also interesting to note the 2.9% who receive action in committee but make it no further — this is the smallest category. This indicates that it’s quite rare for lawmakers to actually take action on a bill and then for it to die in committee.

However, the big divide I want to focus on is the “Referred to committee/subcommittee” vs everything else. This divide seems to be between serious bills(our friend Bill is obviously an upstanding, serious bill), and bills that are not worth our attention. Our main model will therefore deal with trying to predict this split. We will also look at trying to predict laws, and trying to predict all five categories (with varying degrees of success).

Before we jump into modeling, we need to take a closer look at our features. The features with the highest correlation to bill success are:

Number of cosponsors
Sponsor party rank
Sponsor committee match
Bipartisanship
Length of summary
Sponsor in the majority
How many committees bill was referred to (bills referred to multiple committees have better prospects)
Committee
Subject

These all make sense. More cosponsors means more people rooting for and working towards bill passage. Higher rank in the party means a more senior sponsor with more clout and more relationships with other lawmakers. Sponsor committee match means that the sponsor is in the committee the bill gets referred to, which means they can help it get past the biggest hurdle of actually getting consideration in the committee.

Interestingly, the bills with the highest chance of being passed appear to be those dealing with environmental and energy issues. The committee with the highest correlation with success is the Energy and Natural Resources Committee, and the bill subject area with the highest correlation with success is Public lands and Natural Resources.

The bills with the lowest chance of passage appear to be those that deal with finance and taxation. Below, you can see the proportion of laws (in orange) as a proportion of bills proposed in each subject area.

Preprocessing

One of the biggest challenges of the dataset was the high imbalance between negative and positive classes. Especially when trying to predict with bills would become law, I had a negative class of over 100,000 and a positive class of less than 4,000.

I dealt with this in two ways, one was bootstrapping, and the other was focusing on measures of success other than accuracy. It’s important to not accept accuracy as a valid measure of success in this case because a model could return 97% accuracy just by predicting that every bill would fail.

I bootstrapped my training data using sklearn’s resample module, which allows sampling with replacement from certain classes. Below is the function I used to draw samples from all the classes I wanted to be better represented in the training data:

However, this is not enough. Even though my training data is now basically balanced, my testing data is still as unbalanced as ever — which is how it should be. Do not touch your testing data.

Therefore, instead of focusing on accuracy as a measure of success, we will be focused on the recall rate. The recall rate is the percentage of total relevant results correctly classified by your algorithm. In other words, instead of focusing on the number of predictions the model got right, we’re going to focus on how many of our positive class (the laws, or the bills that got action) were predicted correctly.

Modeling

I ran many different models. This includes both using different machine learning models (mainly Logistic Regression, Random Forest, and Adaboost), and attempting to predict different sets of outcomes. With each model, I bootstrapped to balance classes in the training data and scaled using the MinMaxScaler to make sure all the variables were scaled from 0–1. Let’s look at the various model results:

Law Model

Predicting whether or not a bill would become a law. The Random Forest Classifier produced the best results.

Accuracy: 96%

Recall: 20%

Verdict: The model is basically classifying every bill as failed — but getting high accuracy because the vast majority do fail. Not useful.

2. Three Category Model

Predicting which of three categories a bill falls in: 1) Referred to committee, 2) Progress Made, 3) Law. Logistic Regression provided the best results.

Accuracy: 71%

Recall: Law: 67%; Progress Made: 51%; Referred to committee: 75%

Verdict: Somehow, once the model is given more choices, it’s better at classifying the laws. It is now over predicting law vs progress made, as you can see in the graph below. Somewhat useful.

3. Action Taken Model

Will the bill get beyond being referred to committee? A Random Forest Classifier produced the best results.

Accuracy: 79%

Recall: 77%

Verdict: The model over predicts bills becoming laws, but does correctly catch 77% of laws. Useful.

It’s interesting to not only see the results of the models, but to think about why some models work better than others. It appears that there are certain reliable indicators about a bill and its sponsor that mean it has a good chance of making it beyond committee. For instance, a “non-serious” bill may only have one sponsor and that sponsor may have no connection to the committee to which the bill is assigned.

Lawmakers propose bills for many reasons, and the main reason is not always to actually try to legislate. They may simply want to make a statement, they may want to be able to tell their constituents they’re doing something (even if they really aren’t), or they may simply want to start a discussion. This model can help us look at only the serious bills, the bills actually worth our attention. So it looks like our friend Bill may actually have a decent shot at making it out of committee.

4) Two Stage Model: Second Stage Four Category Prediction

Just as a point of interest, I also tried to predict, based only on bills that made it beyond simply “referred to committee,” the four other outcomes. The Random Forest Classifier produced the best results.

Accuracy: 62%

Recall: Law: 57%, Passed one chamber: 49%, Referred to full chamber: 67%, Action in committee: 66%

The model’s performance is not great, and it strongly over predicts the number of bills that will make it out of committee to the full chamber. This may indicate that there is little difference between bills that get hearings in committee and bills that are referred to the whole whole chamber. Overall, not terribly useful.

Conclusions

I’ve included all these models because I think it’s important to show failure and success. While the “action taken” model does provide accurate enough results to be useful, the three other models are rather terrible.

Lack of results is a result in and of itself. I had a great deal of data (20 years of it!) and it still wasn’t enough to make good predictions on whether or not a bill would become a law. Other data (perhaps the actual bill text?) might make the predictions better, but overall it’s just a difficult problem. There are a lot of moving parts, (435 congresspeople and their personal staffs, committee staffs, the White House, lobbyists, foreign governments, not to mention all the voters in this country…) many of which act in ways that defy prediction.

While machine learning with the available data did not allow us to successfully predict Bill’s fate, it did allow us to observe what factors are most important to a bill’s success and shed light on what happens with the thousands of bills that come out of Congress every year.

So long, Bill, and best of luck on your uncertain path.

Predicting the Path of Congressional Bills

Preprocessing

Modeling

Conclusions

Written by ejafek