D4D at the Three-Month Mark

Or: Hackathons, Competitions, and Press Coverage, Oh My

Lilian H
Data for Democracy
8 min readMar 14, 2017

--

So much has been going on since last month’s check-in that it’s hard to know how to put it into words. So let’s start with some numbers!

Data for Democracy has reached:

1038 members

More than 1430 GitHub commits, across 37 GitHub repos

Over 75000 Slack messages

That’s right — we’ve passed the 1000-volunteer mark and are still going! You can be one of the next thousand, just by emailing team@datafordemocracy.org. Read on for some exciting things and upcoming events you can be a part of, if you join the D4D community.

Project Updates

Here’s what some of our active projects have been up to in the past month, and where and how you can jump in.

Assemble

The Assemble team, which is developing tools for researchers to use to study online communities, has been focusing on developing and deploying approaches to community detection. Within the next few weeks, they hope to put together a cohesive approach to community detection that can be easily picked up and used by researchers.

The team chose to turn their sights to this because they view community detection as a vital component of several larger buckets of intriguing research.

  • Information diffusion and social contagion: Namely, how information moves into and across a community. The past decade or so has seen a veritable explosion of research regarding how information flow happens in online social networks, and the team considers it important to understand how this happens in “far-right” and related communities.
  • Bot detection: An important aspect of studying online social networks is the detection of digital personas. Why are they there? What are their characteristics? There are a great many avenues of research available for pursuit in this area.
  • Language usage across communities: There are opportunities to build on recent research to predict changes in community composition based on their language use — for example, by examining the idioms, euphemisms, and memes that the community uses to communicate, and the graph properties of how this language is structured.

A team member, Henri, has made major headway in developing and releasing the preliminary version of a community detection algorithm. His work has been constantly supported by his other teammates who are working on data collection and curation across dozens of online social network platforms. It’s still in its early stages, but initial findings are promising, and the team has high hopes that this will evolve into a one-stop solution for community detection and bot detection.

How You Can Get Involved:

There’s still a lot more work to be done with curating source datasets and developing algorithms, and the team would love more fresh perspectives.

If you’re interested in novel approaches to community detection in online social networks, consider jumping on board at the #assemble Slack channel! This is a golden opportunity to study some emerging communities and how social contagion works across the “far-right”.

Drug Spending

The Drug Spending team is currently neck-deep in data acquisition, consolidation, and wrangling. Information on drugs and drug spending comes in a whole range of disparate datasets from multiple organizations, with little universal information to cross-match between them. In order to dive further into the available datasets, the team first needs to make sure they exist side-by-side in a cleaned and easily accessible form.

In line with this, the team has mainly been focused on acquiring and cleaning data sources. They are working hard at data consolidation — building indices and references between datasets, such that they can easily be associated with each other.

They have also ventured into making initial visualizations and exploratory analyses of these datasets. One volunteer, Stephanie, has been building an R Shiny app to visualize manufacturers and the drugs they produce, based on Medicaid and Medicare data. Another volunteer, David, is using a related dataset to visualize year-over-year increases in drug spending.

How You Can Get Involved:

The team is currently grappling with drug classification, and trying to develop a framework to associate drug names with their general therapeutic uses at a large scale. The project could greatly benefit from the input of contributors who have domain knowledge and experience regarding pharmaceuticals and health policy. If you fit the bill, your help would be very welcome! Check out the datasets and the #drug-spending Slack channel.

Election Transparency

The Election Transparency team has been busy over the last few weeks, working on collecting, normalizing, and standardizing historical data on county-level electoral outcomes. This effort continues to pose new challenges, as the format of the data varies widely — some states provide Excel spreadsheets, while others only make data available in PDFs. The team continues to work through these hurdles and find creative ways to build out their (publicly available!) dataset.

They have also started to build models and create visualizations, leveraging their newly cleaned data to help explain the outcome of the 2016 U.S. presidential election. One contributor, Robert, drew on the dataset to build a Shiny app that displays county-level Presidential election results back to the year 2000.

Moving forward, the team is starting to collect data and shapefiles to explore the impact of redistricting and gerrymandering on electoral outcomes. Through partnering with the OpenElections Project to build precinct-level data for all statewide races, the team hopes to address questions including:

  • What are appropriate measures of a fairly-drawn district?
  • How much do current district boundaries deviate from a fairly-drawn district?
  • How well do current districts represent the demographics of the country?

Stay tuned for more detailed posts about the Election Transparency project in the coming weeks! These will delve into the nitty-gritty of the data collection process, and the key findings that have been made so far.

How You Can Get Involved:

The team is always interested in additional projects that will further their goal: to make the electoral system more transparent and easier to understand for everyone. If this is of interest to you, check out the project details and join the #election-transparency Slack channel, and let them know if you have an idea!

Internal Displacement

Based on a challenge set by the Internal Displacement Monitoring Centre (IDMC), the D4D Internal Displacement team is building a tool to populate a database with information about displacement events, which can then be used by both machine and human analysts.

So far, most of the team’s efforts have been focused on building a Python back-end to scrape, classify, and extract information from articles in IDMC-provided datasets. Information retrieval is proving to be an interesting challenge, in part due to the complexities involved with natural language processing. The team has recently succeeded in finalizing the database schema for the project, and is now shifting gears to building the front-end app.

How You Can Get Involved:

The team is looking for volunteers who are interested in helping to build the front-end app to visualize and interact with the database.

In addition, the team will continue their work to further refine the back-end code, work on the tricky issue of natural language processing, and implement online machine learning for new documents. It’s a good time to join the #internal-displacement Slack channel if you’re interested in any of these areas!

ProPublica

So far, the collaboration between ProPublica and D4D has centered on two main threads: data relating to official foreign travel of elected representatives, and data on House Expenditure reports. The team’s work has largely focused on loading and cleaning these datasets to standardize them, remove duplicates, and wrangle the data into a convenient format for analysis.

The foreign travel dataset is based on the House Official Foreign Travel reports, published quarterly by the House Clerk. ProPublica hopes to eventually use this dataset to track how official foreign travel expenditure has changed over time, and in particular whether and how it is influenced by political and international events.

The House Expenditure project was one of several undertakings adopted by ProPublica in 2016, after the closure of Sunlight Labs — an open source community run by the Sunlight Foundation, which sought to use data to increase government transparency and accountability. Through working with D4D on this dataset, ProPublica aims to detect unusual variances in spending by lawmakers. ProPublica also hopes to eventually add a search interface and make the data available for download.

How You Can Get Involved:

If you have any interest in investigating lawmaker activity and ensuring accountability, ProPublica is looking for more volunteers. Feel free to browse the data and join the #propublica Slack channel if it piques your interest!

USA Dashboard

The USA Dashboard team, which is creating a dashboard that will display key metrics for various regions of the USA, has been moving data into PostgreSQL; the D4D community can now access the data through Mode, and carry out reports and preliminary analysis there. The team is currently working on defining data documentation and writing dictionaries, in order to facilitate more targeted analysis.

How You Can Get Involved:

The team is seeking domain experts who can provide input on how to count crime reports to make a fair comparison across cities.

As the project broadens its focus to explore new metrics, including economy, poverty, and healthcare, the team would also like to recruit domain experts in these areas, who can help to develop and frame research questions based on the available data.

If you’d like to be part of this effort, drop the team a line at the #usa-dashboard Slack channel.

D4D in the News + KDNuggets competition

Besides all the activity going on internally, D4D has also received some great press coverage elsewhere!

KDNuggets reposted last month’s blog update, and we received a shout-out in a TechCrunch feature about Data.World, one of D4D’s most enthusiastic partners and supporters.

We’ve also sponsored a “Data Science vs Fake News” contest, in collaboration with KDNuggets and Data.World. The deadline for submissions was on March 10, but if you missed it, no worries — there’s another upcoming opportunity to show off your data science chops!

D4D Hackathon

Banner by volunteer Justin

We’re excited to announce the very first Data For Democracy Global Hackathon! More information about this event will be given in an upcoming post, but here are the key facts:

FROM: March 31st, 2017, 6 pm EDT
TO: April 2nd, 2017, 2 pm EDT

Anyone can participate, simply by signing up as a D4D member (again, shoot an email to team@datafordemocracy.org and we’ll get you sorted out)!

Since this is a global group, many people will be working remotely, using tools such as Google Hangouts, Slack, and GitHub. In-person meetups are also being organized in some major cities; again, more details will be coming nearer the date. Feel free to arrange a meetup of your own!

The Hackathon will conclude with a showcase on April 2nd, 2 pm — 3 pm EDT, where the D4D community will display and demonstrate what they built during the Hackathon, in a series of short presentations.

If you have a cool project idea you’d like to get off the ground, or want to get involved with D4D but don’t know where to start, this weekend will be an excellent opportunity to dive straight in. Don’t worry about not knowing anyone — by the end of the Hackathon, you will.

Keep an eye out for more news in the coming weeks!

--

--