Fighting Ignorance with Data

Democratizing data is how we open source progress

Published in

TIL with BigQuery

4 min readJan 19, 2017

No matter if it’s art, science, or philosophy — humanity’s progress has always been aided when data is allowed to flow freely.

Playing with large datasets helps us add context to our observations; by letting us take a step back from what we can see ourselves, we can make out the bigger picture.

By democratizing data — making rich datasets available for practically anyone to access and use — gives more people an opportunity to find useful information within the petabytes of data; leading to improved city services, scientific discoveries, and new business opportunities.

Critically, it also means practically anyone can verify and replicate findings — a cornerstone of the scientific method.

Scientists, cities, and government agencies already use this data extensively, and folks like FiveThirtyEight are doing an amazing job of presenting observation using interesting visualizations. And we get to play with big data too.

One of the best things about my job at Google is indulging in new passions — provided I can come up with a justification

I’m not a data scientist, but I’ve always been a data nerd. I love spreadsheets and graphs, analyzing data, looking for patterns, understanding complicated questions, and visualizing complex things in simple ways.

So when I started poking around the public datasets in BigQuery, it was an instant love-match

You start by seeing how unique your name is. Then, when you’re wife’s name peeked in popularity. Soon you’re doing party tricks guessing a person’s birth state and age and calculating the effect of presidential elections on baby names. And that’s before you move on to the real big data.

I want to keep poking around, and also keep getting paid, so I’ll be documenting my BigQuery journey, and including contributions from colleagues and industry experts, in a series of blogs and videos we’re calling “Today I Learned with BigQuery”. The first post is the launch of our New York City public dataset, and is now available.

The difference between skepticism and willful ignorance, is a willingness and ability to evaluate facts to draw logical conclusions

The proliferation of fake news, poor reporting, and a “post truth” political dialogue has made it more important than ever that we democratize data, and provide the tools and services needed to look at the data behind the claims.

There’s never been a time when more raw data has been made publicly available. Cities like New York have put much of their data freely online, so you can investigate claims of rising crime and deteriorating infrastructure. NOAA weather data can help you understand how weather changes can effect our lives. GitHub data lets you confirm if tabs or spaces are more popular, and the birth name database can tell you if Dylin is a boy’s or girl’s name.

BiqQuery public datasets aren’t the only place you can access Big Data repositories

Many government agencies, cities, and companies host their data and make it freely available. BiqQuery competitors like Amazon also share data collections.

I like BigQuery because it’s fully managed so I don’t have to setup and configure anything. It lets me easily share tables and queries, it’s super fast, and for the datasets I’m working with it’s incredibly cheap — free, in fact, for the first terabyte each month.

There’s also a big advantage in having lots of different data sets in the same place, making it easier it use them together. We can use the NOAA GSOD weather data to understand the effect of temperature on calls to 311, how rain impacts car accidents, or how weather alters Citi Bike and taxi ride patterns. Or we can compare cities to find patterns of similarity and difference — and what that can teach us about city planning and improvement.

Also, I work at Google — so if I want new features or new datasets there’s a good chance I can make it happen

We’ll be posting weekly blog and social media posts featuring video and visualizations.

That includes launching new public datasets, demonstrating BigQuery and related technologies, sharing protips, and offering interviews with industry experts and real data scientists.

We’d love to get the community involved, so if you have questions about using BigQuery, suggestions for things to investigate, or new datasets we should try to get added — send them our way and we’ll endeavor to answer them as part of the series.

We’d also love to see what you all find — so if you learn something cool using BigQuery share it using #TILwBQ.