How a complex network of bills becomes a law: Introducing a new data analysis of text incorporation!

Provisions from 15 separate bills were merged into two bills that Congress ultimately enacted.

A new analysis we incorporated into GovTrack late last year reveals when provisions of bills are incorporated into other bills. Our new tool will reveal much more about what Congress is doing, and what laws are being made, than has ever been known to the general public.

The story we’ve been telling ourselves.

Ever since recent Congressional gridlock began around 2010, there has been keen interest in measuring it by counting up the number of bills that Congress has enacted year after year and looking for a nose dive:

Total number of bills enacted and the total number of words in those bills.

We spun this story in 2014. Our colleagues at Democracy Fund wrote about it last month:

Others have used these counts of enacted bills as horrifyingly misleading measures of legislator effectiveness.

The simplified story that we keep telling ourselves, that a bill becomes a law, has kept us from seeing the forest from the trees.

This, from, is what we were taught in school. A bill goes through about 6 stages in becoming a law:

In the wonderful “How Our Laws Are Made” infographic by Mike Wirth and Suzanne Cooper-Guasco in 2010, the steps from bill to law are expanded from 6 to 13:

By Mike Wirth and Suzanne Cooper-Guasco, 2010.

From 13 steps you could go to this official flow chart with well over 20 steps or this 65-page official explanation. But we’re still missing the big picture:

Consider “re-introductions.” At left is GovTrack’s history of Rep. Eleanor Norton (DC)’s New Columbia Admission Act.

We show that the bill was first introduced in 1991, had a vote in 1993, and was re-introduced three more times prior to the current version of the bill in the 114th Congress.

But there is still much more to the story of legislating.

How a complex network of bills becomes a law.

All too often Congress cuts bills apart and pastes them back together — sometimes into an “omnibus.” The bills that finally get a vote are an amalgam of provisions from other bills that either can’t or won’t get a standalone vote themselves.

The most important legislation is crafted this way.

The diagram at the very top of this post maps out the ten sources of the provisions that finally made it into the 21st Century Cures Act, a landmark law related to drug research enacted late last year. The diagram below maps out 16 bills that were cut apart and pasted back together to make 8 new laws last year about reducing government waste and abuse:

The bills circled in black were not enacted but had significant parts of their text in common with a bill that was enacted, indicated by a red circle. The number shown by each arrow is the percentage of text in the un-enacted (black) bill that occurs in the enacted (red) bill. (Some of the numbers are small, even close to zero, but those are cases where the bill is quite long and a small percentage still represents a whole provision.)

This isn’t just a matter of discovery. It is a window into how Congress really works, the processes that only insiders are normally able to see. Daniel Schuman of Demand Progress pointed out to me during the drafting of this post:

In other words, seeing the network of bills is crucial for understanding how the minority political party in Congress (currently the Democrats) are able to work with the majority party (currently Republicans) and achieve legislative goals. Without looking at the network, Congress may appear far more partisan than it really is. And without knowing how Congress works, outsiders can’t hope to be effective participants in our own government.

Only about 3% of bills will be enacted through the signature of the President or a veto override. Another 1% are identical to those bills, so-called “companion bills,” which are easily identified (see CRS, below). Our new analysis reveals almost another 3% of bills which had substantial parts incorporated into an enacted bill in 2015–2016. To miss that last 3% is to be practically 100% wrong about how many bills are being enacted by Congress.

And there may be even more than that, which we’ll find out as we tweak our methodology in the future.

There are so many new questions to answer:

  • Who are the sources of these enacted provisions?
  • How often is this cut-and-paste process cross-partisan?
  • What provisions were removed from a bill to be enacted?
  • Is cut-and-paste more frequent today than in the past?

(H/t to Daniel for suggesting some of these questions.)

What to look for on GovTrack

We’ll show show two new statuses for bills when their provisions have been incorporated into enacted bills.

The first is “Enacted Via Other Measures,” which we’ll show when a bill has at least about 33% of its provisions incorporated into one or more enacted bills, and we’ll link to the bills its provisions were incorporated into:

We’ll say “Parts Incorporated Into Other Measures” when a bill has some of its text, but less than about 33%, in common with enacted bills.

We’ll also show the same information, but from the opposite perspective, on the pages for enacted bills. When an enacted bill has text in common with other bills, we’ll show all of those other bills in a new section called “Incorporated Legislation”:

In both cases, “compare text” links take you to a comparison of the text of the two bills (a feature we’ve long had).

Our advanced search has been updated to allow you to find all bills that we consider enacted, either the real way (signed by the President, etc.) or by what we now call “enacted via other measures” (see above). Turn on the Enacted — Including by Incorporation into Other Bills filter to see them all.

This is particularly useful when you want to see what laws a particular Member of Congress had a leading role in getting enacted. You’ll want to count not only the enacted bills that had their name on it but also ones that weren’t enacted but had significant provisions moved into bills that were. The Enacted — Including by Incorporation into Other Bills filter does just that.

(This filter replaces an earlier filter called “Enacted — Including via Companion Bills” which performed the same function but only included bills identified by the CRS as totally identical (see below). The new functionality goes beyond identical.)

Now that we can identify all of the bills that were “enacted” by a legislator, including via text incorporation, we are able to show that on our legislator pages. Here’s the new Enacted Legislation section on the page for Sen. Steve Daines, showing a bill that had a major component incorporated into two other bills that were enacted:

The laws enacted part of our legislator report cards, which show key legislative statistics for each legislator, now include bills enacted via text incorporation rather than just bills enacted the usual way.


I’ll cover the methodology in detail next, but first let met list some of the limitations of what we’re doing:

#1. This process is automatic and so, of course, imperfect. There will be cases where the algorithm that identifies incorporated text gets it wrong, either because it sees some text in common between two bills that isn’t meaningful (like just the words “is amended by striking”) or because it misses some provisions that actually are the same but weren’t exactly the same or weren’t long enough to be counted.

#2. The analysis is arbitrary. The algorithm uses arbitrary cut-offs that range from 97% down to 15% depending on the length of bills to determine if enough text is in common between two bills to say the provisions are the same. There is no right answer here. There is no simple definition of “a provision” and no perfect way to identify one by algorithm. If we revise those cut-offs in the future, GovTrack will show a somewhat different set of relationships between bills.

#3. We’re only looking at provisions that move into enacted bills. At a later date we may look at provisions that are similar across bills prior to enactment.


The analysis of this new web-of-bills was easy in comparison to a lot of the work that has come before it.

The information technology staff within the House of Representatives and the Senate created the foundation for this work over the last 25 years by providing the public with access to good, clean, comprehensive, and structured data about legislation in Congress. The new XML bill status data that went online earlier this year, which I wrote about already, plus the text of legislation in XML format, which came online around 2010 but only became complete this year, form the cornerstone of our new analysis.

Our text incorporation analysis begins with the official text of legislation in XML (here’s how to get it), which looks like this:

The office XML is then simplified — removing headings, numbering, effective dates, and boilerplate (because they often change when provisions are moved from bill to bill) — and then flattened to plain text (and some additional Unicode normalizations are applied). The same text in the XML above becomes the plain text here:

That text is compared with text similarly extracted from other bills. The comparison between a pair of bills is performed with Python’s built-in difflib.SequenceMatcher class (with some tricks to reduce noise), which computes the blocks of text in each bill in the pair that they have in common.

We only look at how much text the bills have in common, as a ratio of the length of the common text (in characters) divided by the total length of each bill (in characters), giving one ratio for each bill. When both ratios are very high, the bills are nearly identical. When one ratio is high but the other is low, then parts of the first were incorporated into the second. When the two ratios are both low, there is no substantial text in common.

The only question left is how high is high enough? A 1% text overlap might just be a few insignificant words like “is amended by striking.” Not all common text is important. I’ve chosen some arbitrary thresholds with cut-offs that range from 97% down to 15% depending on the length of the bills, and these cut-offs are producing good results so far. Lowering these cut-offs would cause us to identify more incorporated text — at the risk that some of that text is actually insignificant (as in the “is amended by striking” example).

Since there are about 10,000 bills introduced each two-year Congress, there are too many to compare every bill to every other bill. In the 114th Congress, it would take 258,469,929 comparisons to identify the complete web of text incorporation — that would take a very long time.

To reduce the problem space, I only compared each enacted bill (of which there are about 300 per two-year Congress) to roughly the top 50 other bills that have overall bag-of-words text similarity (computed by the Solr MoreLikeThis query, since GovTrack already has a Solr server running). That picks out about 3,000 pairs of bills to compare, rather than 250 million, and the 3,000 comparisons take about 10 minutes to compute.

The source code for the analysis is posted on github.

(The diagrams in this blog post were created using graphviz.)

Related Work

I want to mention some related analytical work.

The first group to create a large network mapping of bills in Congress was likely the Congressional Research Service (CRS). CRS has been marking bills as “identical” or “related” for several decades, through a process that is probably completely human driven. GovTrack has long displayed this CRS information on bill pages and has used the CRS bill relations to help users navigate between bills to find the one that is most relevant.

A graph of citations between sections of the U.S. Code, with only sections affected by PPACA shown. Li, W., Azar, P., Larochelle, D., Hill, P., Lo, A.W. Law Is Code: A Software Engineering Approach to Analyzing the United States Code. Journal of Business and Technology Law, 10, 297, 2015.

Our analysis is conceptually similar to the work by William Li on tracing back the law to its origins — starting with Supreme Court opinions, public comments on regulations, and the U.S. Code. One of his diagrams of the U.S. Code is shown at left.

Li’s 2016 dissertation and Harlan Yu’s 2012 dissertation are indispensable references on the history and structure of the U.S. Code.

And PredictGov, which currently provides GovTrack will our bill prognosis scores, uses a bill similarity analysis to predict not just whether a whole bill will be enacted but whether individual provisions of a bill are likely to be enacted. You can find that inside the “details” link next to our bill prognosis predictions:

PredictGov prognosis of a bill’s individual provisions.

This post was written by GovTrack founder Joshua Tauberer. Thanks to Daniel Schuman and other colleagues for discussion prior to publication, and to Congressional staff over the years who have made public access to legislative information a reality.

A website for tracking bills in the U.S. Congress. See

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store