If you want to know more about how Academics are using Wikidata in Network Analysis research, here is our full Q+A with the University of Colorado

We covered the University of Colorado’s project in a short blog post today, but feel that the full detail might be useful to people who want to know more about this interesting research.

Who was involved in this project from the University of Colorado?

  • Brian Keegan: Assistant Professor, Information Science. Primary investigator on this project.
  • E. Scott Adler: Professor, Political Science.
  • Thea Lindquist: Professor, Libraries, University of Colorado Boulder.
  • Joe Zamadics: PhD Candidate in the Department of Political Science

How did the concept come about?

Brian had done some exploratory research as a post-doctoral research fellow around parsing “long data” from biographies of members of long-lived institutions like the Catholic Church and U.S. Congress. Scott brings in expertise around studying changes in legislative behaviour over time. Thea brings in expertise around designing biographical ontologies for digital humanities projects and also introduced the rest of the team to the term “prosopography.”

The Library of Congress publishes and regularly updates the biographies of every member of Congress going back to the first Congress in 1789. These biographical data have generally not made it into the data sets that political scientists use to study legislative behaviour.

We hypothesized that the biographical backgrounds of legislators could play an important role in legislative behaviours. By combining these two data sources, we can ask interesting questions like, “Do Ivy League graduates form cliques?” or “Are medical doctors more likely to break with their party on votes concerning public health?”

Why is there a gap between the 81st and the 104th Congress? Was it the availability of data, or something else?

It’s not a problem of missing data but that things get hairier the farther back in time you go: defunct countries, renamed cities, archaic professions, etc.

The 81st Congress was the first congress to start after World War 2 but still captures some of the idiosyncrasies of this older historical data since its members were born in the late 19th century.

The data entry tasks were time-intensive enough that we knew we couldn’t get through all the post-war congresses so we wanted to sample the difficulty of these historical congresses but prioritise getting data for a complete era, in this case, the post-Gingrich (1995) Congress.

What are you hoping to achieve?

We have several goals. One, we would like to create a dataset widely used by those in the field of network analysis. Real-world social networks contain multiple kinds of relationships: communication, trust, shared background, etc. Theories and methods for analysing how these overlapping relationships interact and change are only starting to become mature. Networks of members of Congress containing multiple kinds of relationships changing over time could become a model dataset for network scientists.

Two, we would like to answer interesting political science questions that have remained unaddressed because this biographical information has yet to be put in a usable format. The biographies of Congressional legislators capture important information that could influence the committees they join, the bills they co-sponsor, and the votes they cast. Adding in a biographical data could reveal new dimensions of legislative behaviour alongside classic dimensions like party, ideology, and seniority.

Three, we would like to use this project to demonstrate the power of semantic web data and Wikidata to other social scientists. Many datasets are not public or lack information that makes it difficult to combine with other data. Leveraging a public and robust semantic web technology like Wikidata can promote greater transparency, standardization, and collaboration for quantitative social scientists.

How many students did you work with, over what timeframe?

The lead research assistant on the project is Joe Zamadics, a graduate student in political science. He managed a team of ten undergraduates in the previous academic year.

The project has been ongoing for one year. We plan to continue the project by updating more members’ Wikidata pages and by demonstrating the power of Wikidata through research projects.

We are roughly 15% of the way through all members of Congress dating back to the first Congress in 1789. Our initial grant is winding down and we are exploring other options for funding this research and scaling data entry for the remaining biographies.

How is the project coming along? And how much data has been added?

Our project has revised over 1,500 Wikidata items about members of Congress. We have covered the 104th to the 115th Congresses (1995–2018). We also have the 80th and 81st Congress (1947–1951).

This sampling strategy let us explore the data entry challenges for legislators in two different eras. The information we captured ranges from birth locations to military service to education to occupation. We have many other interesting connections such as which members interned for previous members and which members have other family ties in Congress.

Have you started to run any SPARQL queries yet?

We started running SPARQL queries this summer. We are still experimenting with extracting data from the information we entered over the past year.

One example we tried was looking at House member ideology by occupation. Below shows the ideology of three occupations: athletes, farmers, and teachers (in all roughly 130 members).

The x-axis shows common ideology (liberal to conservative) and the y-axis shows member’s ideology on non-left/right issues such as civil rights and foreign policy. The graph shows that teachers split the ideological divide while farmers and athletes are more likely to be conservative.

House member ideology by occupation

What has been the biggest challenge to date?

The largest challenge has been defining the scope of the project. Ideally, we would be the perfect custodians of Wikidata, meaning we would create and clean pages we need to fit our project.

For example, if a senator attended Vanderbilt’s School of Law, we’d be ensuring that the item for “Vanderbilt School of Law” exists and has the appropriate information in order to perform the appropriate queries. Unfortunately, we must balance the time spent entering information into the member’s pages and cleaning up these neighbour pages.

In the next round, we will be able to spend more time adding Wikidata pages for the various political positions, occupations, and other information we encounter in members’ bios.

What has been the biggest achievement or outcome?

We are happy with many of the tangible outcomes so far such as the number of pages coded and fields updated. Our largest achievement is coming up with a system that allows undergraduates to participate in the project. Academic work that combines faculty, graduate students, and undergraduates is substantive research that represents higher education at its best.

What do you feel you have learnt through the project?

We have learned about the technical aspects of the project, like what to expect with Wikidata, how to run SPARQL queries, and how to manage workflow. We have learned about the human aspect too.

Updating Wikidata pages can sometimes be a monotonous experience. Weekly catch-up meetings that covered the most interesting information found in our week of coding made the day to day work much more interesting and rewarding. We found out some fascinating factoids along the way including how many members of Congress died in plane crashes and unusual previous occupations like “propagating oysters”.

How did you structure the learning process? And was using Wikidata challenging for students?

Students came in with little Wikidata or Wikipedia editing experience. They picked up the task very quickly through a structured process of starting with simple data entry tasks progressing to more complex and subjective tasks.

After a few weeks of training, every member of the team was able to work on their own, largely unassisted and data collection was able to increase rapidly.

What is the background of students? Were they technically-minded?

Our team came in with an array of technical skills. The majority of the team were political science students with limited quantitative data experience.

One member had some experience in computer science. We had many members focusing on interesting questions like gender equality and foreign affairs in their own work that kept them intrinsically motivated through the project.

Did you use tools to input data, or was data manually entered?

So far, we have manually entered all data through the Wikidata interface. Our future plans include bulk inputs by placing existing data in political science into Wikidata using customized scripts.

What are the next steps for the project ? What are you developing? Or what would you like to develop?

We are developing several aspects of the project. One: we are working on a paper that will highlight the advantages of using Wikidata for quantitative social science researchers.

Two: we are working on future grants and data that demonstrates the effectiveness of semantic web-based data collection and sharing for other social science data.

Three: we are developing ways to test the quality of the data we entered against independently-collected data.

Have you connected to other members of the Wikidata community?

To date, our main involvement has been with mySociety, but we would love to reach out to more projects to learn about their processes and best practices.

Ideally, others interested in the United States Congress would join us and help us update pages, run queries, and mould the project process.

Is there anything you would love to learn more about and need support on?

There are many aspects of Wikidata that we would like to learn about.

Constructing SPARQL queries is a constant learning process. Typically, the more interesting our question, the more difficult the SPARQL query is to answer it.

Fortunately, the SPARQL community advocates sharing queries so others can improve their own skills through example.

We would also like to learn about how other projects manage the many layers of pages and updating. How far should we go in improving the pages we encounter and what are the best practices for building a large-scale project?

How can people keep updated on the project?

Currently, we are working on a paper that we hope to submit to a political science journal that would cover our process and the benefits of using Wikidata. Hopefully, we have other research updates to share with academics and the broader Wikidata community in the future.

Anything else we should know?

That covered a lot! Thank you for allowing us the opportunity to share our work with a broader audience.