I’m always looking for new datasets for ML projects, so I was particularly excited to discover this public domain dataset of ~400k congressional bills. The dataset has 20+ data points for each bill. Here’s an example a subset of this data for one bill:
- Title: A bill to provide for the expansion of the James Campbell National Wildlife Refuge Honolulu County Hawaii
- ID: 109-S-1165
- URL: https://www.congress.gov/bill/109th-congress/senate-bill/1165
- Topic: Public Lands
- Date Introduced: 6 June 2005
- Date Passed: 25 May 2006
- Congressperson who introduced it: Daniel Inouye
- Passed: Yes
The bills from this dataset were all manually assigned a topic by domain experts. The resolutions in the dataset are unlabeled which makes it a great fit for AutoML Natural Language — we can use the labeled bills to train the model and see how it performs on unlabeled resolutions. We’ll use the title of the bill as the input to our model, and the label will be the topic. The original dataset includes a higher level topic (
Major field in the dataset) and a more specific (
Minor) topic for each bill. To keep things simple, we’ll use just the
Major topic field to categorize bills.
Note that there are an infinite number of possible input and label combinations you could use to build ML models from this data, we’re just using two in this example,
Major topic. Here are some sample inputs and predictions for our model:
Title: To permit the televising of Supreme Court proceedings.Category: Technology-----Title: A bill to provide a program of tax adjustment for small business and for persons engaged in small business.Category: Domestic commerce
The first step in building a model with AutoML NL is uploading a CSV of training data. I wrote a script to extract only bills with a topic assigned and strip some characters from the text. The resulting CSV looks like this:
If you’d like to play with the dataset I used to train the model, I’ve made the CSV file publicly available here.
Training an AutoML NL model
Now that we’ve got a CSV with our text inputs and their associated labels, we can upload this directly to the AutoML UI to create our dataset:
Once it uploads, we can look at all of our training data in the UI:
To train our model, all we need to do is press the train button:
How cool is that?! We don’t need to write any of the underlying model code — AutoML will handle that automagically for us.
Evaluating our model
To evaluate our model we’ll look at the confusion matrix in the Evaluate tab of the UI:
It may look confusing (it is called a confusion matrix after all), but turns out it’s not so hard to understand: what we ideally want to see here is a strong diagonal from the top left. This tells us the percentage of text items from our test set that the model was able to classify correctly. Side note: AutoML automagically splits the data we upload into training, test, and validation sets.
The confusion matrix shows the accuracy for 10 of our 20 topics, but we can also look individually at a topic to see specific examples that our model classified correctly and incorrectly. Looking here, we might want to improve the training data for
Domestic Commerce bills, since our model only classified 78% of those correctly, and confused about 10% of them as
Generating predictions on unlabeled bills
Next it’s time for the best part — generating predictions on unlabeled bills. We can try this out right in the AutoML NL UI. Let’s try the following unlabeled bill:
A concurrent resolution making the necessary arrangements for the inauguration of the President-elect and Vice President-elect of the United States.
The model says there’s a 95.6% chance this bill is related to
Government Operations. I’ve run two more examples through our model to see how it performs on new data:
Bill: Honoring women who have served, and who are currently serving, as members of the Armed Forces and recognizing the recently expanded service opportunities available to female members of the Armed Forces.Predicted label: Civil Rights 88.5%, Defense 7.5%-----Bill: Setting forth the congressional budget for the United States Government for fiscal year 2010 and including the appropriate budgetary levels for fiscal years 2009 and 2011 through 2014.Predicted label: Macroeconomics 99.5%
We don’t have the ground truth label to compare our model’s predictions to since these are unlabeled, but the results generated look pretty accurate. If we want to build an app that auto-classifies new bills, we could do that with a simple AutoML API request to our trained model. Here’s an example using curl:
Be on the lookout for a post from one of my teammates that covers using this API to classify new bills.
For the American readers here: since Election Day is coming up in the US and this post covered a political dataset, I’ll use this opportunity as a shameless plug to encourage you to vote :)
Ok, back to my regularly scheduled programming — check out these resources to learn more about what I covered here:
And if writing model code is your thing, check out this post I did on building a text classification model with Keras to predict the price of wine given its description, or this one on using TF Hub to build a text classification model. Stay tuned for more blog posts on this dataset. I’d also love to hear if you do anything interesting with it! You can leave a comment below or find me on Twitter at @SRobTweets.