A Guide to Investigating the Datafied State Through Documents

Gleaning knowledge from documents involves more than just reading the words on a page.

Jenna Burrell
Data & Society: Points
11 min readMar 9, 2023

--

If we want to study power, and the tactics and practices of elites and others who hold it — in other words, if we want to study “up” — perhaps a good place to start is in the archives. As Tamara Nopper has pointed out, sociologists and other researchers who study inequality tend to study those without much power, with the goal of documenting and understanding their experiences and points of view. Yet research is another word for scrutiny, and Tamara’s point is that this scrutiny has been lopsided. As a way to redirect attention to who holds power, and to understand how they wield it, it is worth our time to treat document analysis as a method. Where research calls the accounts of those in power into question, it helps to come with receipts.

Document analysis as a method

In my many years of teaching qualitative research methods at UC-Berkeley’s School of Information, I worked with many students who were doing fieldwork in materially-rich environments, including places where work and life revolve around documents. But I’ve never found an adequate guide to document analysis that clearly specifies a method. This account is meant as a step toward remedying that.

Document analysis is not the same thing as text analysis. The latter has been well-documented as a method, but typically directs a researcher’s attention narrowly, to the words on the page. Content analysis, a common form of quantitative text analysis, highlights word frequencies. While this method offers concrete and precise evidence, it is sharply limited. More theoretically robust forms of text analysis — semiotics, discourse analysis — tend to neglect materiality. They attend to words and their meaning, but abstract away the document itself, ignoring things like paper quality, how it is bundled, where it has traveled, what its creation, circulation, or storage enacts (or fails to enact) in the world. Such standardized methods also treat meaning-making in a general way. They do not always help us think about the specialized and surprising ways that words are defined within professional and bureaucratic cultures.

As Tamara recounted, Data & Society’s summer series on methods of document analysis for studying the datafied state examined several genres of documentation that emerge in the United States government, from the federal to municipal levels — patent applications, lobbying transparency reports, legislative documents, requests for proposals, city contracts with tech firms — all the things that fill the “bureau” in a bureaucracy. Documents play a critical role in exerting administrative control, as Matthew Hull notes in his review article on the role of documents in bureaucracies. They don’t only represent what is happening in government, they enact it. Long before computer automation came into government use, the traditional materials of bureaucracy were often a stand-in for human decision-makers. Rules, forms, and processes made clerks more interchangeable and minimized human discretion while taking the pressure off of government workers to justify their decisions.

What can and what can’t we learn from government documents?

From those invented specifically to pull back the curtain on government, like lobbying transparency reports, to the mundane and diverse minutiae surfaced by Freedom of Information Act requests, to the transgressive lode of classified documents in the Wikileaks archive, documents have become a potent symbol of government transparency. Yet these documents can easily deceive. Government documents emerge from arcane bureaucratic processes and are filled with specialized language, rendering them opaque. What seem like common sense words can have altogether different meanings. Sometimes the conclusions that can be drawn from them seem obvious, but are altogether wrong. Pondering why documents are overlooked in ethnographic research, Matthew Hull suggests they are easy to see as “giving immediate access to what they document.” In other words, they may be treated as pure and self-evident representation.

Take a particular type of document, the US patent application. As a document, its purpose is to protect the inventions of the patent filer for exclusive rights to commercialize an idea. An easy conclusion to draw from this is that a patent represents an invention that is feasible, that has actually been built, and that is actively being commercialized. Yet the practice of filling a patent application reveals a more complicated dynamic. There is often a great deal of back and forth work between the patent filer and patent office to shape something narrow enough to be granted a patent; many patent applications are little more than a first draft of an idea that goes no further. Also invisible is the lucrative business of patent portfolio building and the licensing that corporate tech firms engage in with one another. In fact, Big Tech portfolios are filled with unimplemented ideas that firms have no intention of building.

But it’s not only what is misleadingly present, but what is invisibly absent, that challenges researchers. Our imaginations can fail us as we consider only what confronts us on the page. Redacted documents remind us more bluntly of the exceptions that allow government officials to withhold information. Some documents seem to act out transparency without fully delivering it. And many documents can be difficult to hunt down in the first place.

In the United States, the Freedom of Information Act (FOIA) request has become a standard tool for journalists and researchers probing into the workings of government. While FOIA plays an important role, it is not a cure-all for understanding what happens in government. For one thing, it applies only to federal government documents. State and municipal governments have different barriers to access: they operate by different rules of transparency, public records are unevenly digitized, and there are often fees to acquire copies. It is often still necessary to visit a public records office in-person in order to request documents. On the other hand, many, many government documents — legislative records, city council meeting agendas, etc. — can now be readily acquired online, without going through human intermediaries.

@ErinLawbrarian

Documents as part of a social world

It is helpful to recognize documents as a distinct type of material artifact. Drilling down into the many types of government documents, we find that each has its own peculiarities. This is one key reason why the work of interpreting documents has evaded a structured methodology. A school intake form, a municipal contract, a privacy impact assessment — each type of document has little in common with the others, and there are few common standards or structures that carry across them. As a result, each document genre must be scrutinized, learned, and mastered on its own. Still, there are some structured questions we can ask of any and all documents to help us find a way in:

  • Is there a listed author or authors? Who are they, and what is their role? Is the document formalized by a signature? What does the signature formalize or certify? Who commits a signature to the document, and why that person?
  • Is the document a (legally) required one? What story or event precipitated this requirement? What are the requirements surrounding its production? How have practices of document creation adapted to the requirement, or how might they have drifted away from the document’s original intentions over time?
  • What are the document’s key sections? What is consistent across examples within this document genre? What varies?
  • Once finalized, what does the document complete or effect?
  • After the document is produced, is it ever consulted? Who consults it? Under what circumstances do they consult it? What consequences follow from consulting the document?
  • How does the intended purpose of the document depart from the ways it is used?

These questions reflect the understanding that documents are part of a social world, and specifically, are part of work practices. Being able to make precise claims to knowledge gleaned from documents — particularly claims about how government works — requires more than just reading the words on a page. Without knowing when and where they are downloaded, handled, or invoked in the world, reading documents as texts leads to a flattened understanding of what they can do.

In her study of how and where risk scores are used in court cases, Stanford Professor and Data & Society affiliate Angèle Christin makes this point clearly. In the criminal legal system, risk scores attached to individual defendants have influenced decision-making about parole release or to set bail or bond amounts. Previous work that evaluated risk scoring algorithms had considered their computational patterns, finding forms of racial bias in how these predictions erred. Christin’s approach, by contrast, attends not to the scores by themselves, but to computer interfaces, documents, and even algorithms as things that exist materially in the world. She examines how risk scores are created, deployed, and referenced within the work practices of probation officers, judges, defense attorneys, prosecutors, and clerks. In her ethnography of courtrooms in an urban area of a southern US state that she refers to as Marcy County, Christin observed how risk scores were stacked toward the bottom of a huge pile of papers judges receive about a case, and how they often go unmentioned in court proceedings. How did judges actually encounter these scores? Did they even look at them? By asking these questions, Christin underscores how, when attention is focused on documents themselves rather than how they are used, broader consequential context is lost.

A roundup of document genres

As a starting point for others interested in studying the Datafied State, we offer this list of document types, pointers about where they can be located, and some hints about how to read and interpret them. We aim to direct readers toward a better understanding of what knowledge claims you can and cannot make from these documents.

While these first two document genres provide little direct insight into government, they demonstrate how transparency requirements, enforced by the government, can create visibility into the private sector and specifically into tech industry practices.

Patents

The power of the government to grant patents and copyrights was written into the US constitution. Patents filed or granted by the US patent office can be easily accessed in two places: through Google patent search (for a nice, clean, fast user interface), or through the USPTO search site. Patents are granted only to human “inventors” but the “assignee” is often a company. It can be tempting to see patents as a way to get a glimpse inside of tech companies, but as noted above, they are easy to misinterpret. Many large tech companies patent ideas they never intend to implement, with the hope of then selling or licensing those patents.

SEC filings

In the wake of the stock market crash of 1929, the Securities Act of 1933 mandated that all companies that sell stock to the public file quarterly and annual financial reports with the SEC. The purpose of these filings is to help investors make investment decisions. For researchers, however, they are illuminating documents for understanding what a company thinks of as its core product, value add, and how it talks about its products and services. For this purpose, they are far more useful documents than patents. They also provide a place to find employment figures and (for social media platforms) usage measurements (like daily active users). Quarterly and annual fillings provide a way to track changes over time and do historical investigations. SEC filings are typically kept organized on company websites on an investor relations page. If you cannot find the documents there, on a company website. you can use EDGAR, the SEC search page.

The following document genres illuminate more directly what is happening within government.

Public records

This term is a catch all for any kind of documentation of government conduct that is not confidential. The public records request is a well-established practice for seeking and obtaining such documents, with a long tradition of use by researchers, journalists, and activists. The FOIA (Freedom of Information Act) request is a type of public records request limited to federal government agencies. You can launch a FOIA for many federal agencies here. A public records request can yield hundreds of pages of documents of many, many different types and genres of document which, typically, you must sort through and scrutinize manually. Many states, counties, and cities have their own public records pages. In many cases you still have to physically walk into a public records office, fill out a form, and pay some money to get copies of the documents you seek. From the Benefits Tech Advocacy Hub, here’s a great guide to making public records requests.

Lobbying transparency reports

The outcome of the 1995 Lobbying Disclosure Act signed into law by President Bill Clinton, these are a relatively recent form of US federal government documentation. Lobbying transparency reports are submitted quarterly and detail every meeting held between members of government and lobbyists, who those lobbyists represent (private firms or organizations), how much money they spend on lobbying, and typically include the issues and concerns they are advocating for. The raw reports can be accessed through a keyword search. Their highly structured format also makes it possible to analyze and visualize patterns in them computationally.

Legislative records

For comprehensive records of introduced federal legislation in the United States (whether passed or not), see Congress.gov. The history of revisions and amendments and the journey of a piece of legislation through committees is all available, and these can be useful for interpretive and contextual readings. Congressional voting data is transformed into useful visualizations at Vote View. To access, learn about, and compare legislation at the state level, Open States is a useful resource, as is the National Conference on State Legislatures.

Municipal contracts

Given the increasing reliance by governments on public-private partnerships, understanding the Datafied State means studying documents such as contracts or requests for proposals (RFPs), which provide a record of the procurement of tech tools and data systems. Municipal contracts between city governments and, for example, tech companies, are often returned from a public records request. Repositories are at the city level and vary considerably: check city websites and city council meeting archives. Municipal contracts with tech firms have proved useful in efforts to understand surveillance and privacy issues for citizens, examine policing practices, and more.

(Privacy) impact assessments

Joining financial impact assessments and environmental impact assessments, these are among the more recently required types of impact assessments. While we’ve heard the warning that required assessments can become “box checking exercises,” and recognize that the significance of any impact assessment is rooted in how it mobilizes different actions, for research purposes these documents are (like the SEC filings described above) useful places to learn more about the kinds of tools governments are procuring and for what purposes. See, for example, the list of recent PIAs of the US federal government’s General Services Administration.

Resources

Here are a few great resources that provide background or “how to” instruction on government document analysis and model the method.

  • Lilly Irani and Jesse Marx (2021), Redacted

Redactions are the black bars that appear over sections of text in documents otherwise released to the public. This appealing risograph printed booklet includes samples of many amusing and mysterious redacted documents alongside commentary and critique. An interview with a “redactor” pulls back the curtain on the practice and logic of withholding information from the public.

A clear “how to” and a call to action.

This review article is the best place to get the big picture view on how to think about documents from a scholarly perspective.

A huge thank you to Kiyo Kubo, Alex Garlick, Kenyon Farrow, Merrick Schaefer, Zoe Nemerever, Brian Hofer, Robyn Caplan, Deirdre Mulligan, and Bob Gellman, our guest instructors for this methods series. Our methods focus was decidedly anchored within the United States, and aligned with new efforts at D&S to work with US federal agencies. Our upcoming Keywords on the Datafied State collection is shaping up to take a more global view, looking both from within the State and outside of it. Stay tuned for more on that soon.

--

--

Jenna Burrell
Data & Society: Points

Director of Research, Data & Society. Professor, School of Information, UC-Berkeley