Methods For Book
This post is for people curious about the quantitative methodology for the blog post series on Silicon Valley politics.
If you have additional questions, I encourage you to respond to this post by adding your own comment, rather than emailing me directly. If you have a comment or question, I suspect others do to.
Note On Methods
I believe in transparency. Everything I do should be available for open and convenient criticism. I will do my best to post all of the data in a publicly available format, and do my best to make the analysis accessible to people who aren’t familiar with statistical software.
More importantly, I am cognizant that recent research, especially on the social sciences, has shown that data scientists can come to vastly different conclusions analyzing the very same dataset. So, in all likelihood, many people will disagree with my methods.
In this document, I will also do my best to post what I think are my own critiques of the data/conclusions, which did not always support my arguments as well as I would have liked. I’ll also be posting particularly thoughtful critiques here, too.
However, I am confident that I have identified that there is something unique about Silicon Valley politics and that they are influencing the world in important ways. The data reflects my best attempt at observing these trends in the most precise way possible. My conclusions will change as people critique my methods and add in their own original data.
I welcome the criticism and want to make it as easy as possible for people to find out the truth.
Where can I get the raw data?
The raw data for the survey of founders is available here as a good doc. It includes everything but IP addresses and real names.
If you’re familiar with R, here’s some code I used to analyze the survey.
For those unfamiliar with statistical software, SurveyMonkey has some good tools to slice data, but you need an account (you must email me to transfer a copy). I also encourage people to use online tools, such as DataCracker. The raw data is categorical (meaning answers are strings and not numbers). This can make analysis more difficult within Google docs itself.
What is the sampling method for the startup founders survey?
The sampling frame comes from Crunchbase, TechCrunch’s Wikipedia-like database of startup founders, executives, and investors. I found the emails of every founder in this database and emailed 1,200 of them in alphabetical order. About half of the emails bounced or were invalid. In total, 129 founders responded. There were 8499 names of founders in the database at the time. 129 respondents gives a margin of error of roughly +/- 7.5%.
So-called “elite” or “expert” populations are typically much smaller than public opinion polls and have a larger margin of error as a result. Internet startup founders represent a much smaller population than is typical even of these specialized samples.
Recognizing the limits of expert populations, it is often better to interpret their results in comparison to other populations, rather than make conclusions on the accuracy of the results themselves. Where founders different significantly from other groups or the general population, I will argue that it is is statistically reasonable to suspect they hold distinct beliefs.
With this in mind, users of the data should be cautious of the limitations and report margins of error responsibly, including margins of error with subsets (even though this is not traditionally done with public opinion polls).
Note: the total number of respondents on the survey is subject to change. I’m continually adding respondents and some respondents are filling out the survey late.
Why did you use SurveyMonkey and Google Surveys?
Online polls are now a well-established tool for collecting representative samples. I’ve duplicated many polls, including those from Pew, using Google Surveys.
Comedians like to poke fun at polls conducted with SurveyMonkey. Anything with the word “monkey” probably doesn’t inspire confidence in scientific precision, but it’s just the name of a startup. The research backing what goes into online sampling frames is very thorough; skeptical readers are welcome to review the established evidence on the topic.
Many of your questions are binary. Why?
The academic term for this is “forced binary” response options and there is a established literature on why they are often superior to typical multiple choice options.
Surveying the public, especially CEOs, on sensitive political topics lends itself to agreeableness and social pressure. Respondents tend to agree with every issue that seems like it would be acceptable in public. This is why on some surveys, such as the Pew American Values Survey, we often see questions that have near 90% agreement. Since there are very few controversial issues, by definition, that have 90% agreement, we need to ask questions in a way that prevents bias by agreeableness.
Second, I wanted to see how survey respondents would do if they were in power. Asking someone to “agree” or “disagree” with a statement is much different than asking them to choose between a few difficult options, which is often the case in real life.
Surveys necessarily simplify people’s beliefs (it’s the only way to quantitatively compare hundreds of people). But, I believe that forced binary response options result in a more honest and nuanced representation of someone’s beliefs on topics they feel uncomfortable taking a stand on.
That said, I will be exploring other survey response options in the future and anyone is welcome to replicate my surveys with alternative wording. I very much welcome interpretations, replication and modification of the questions.
What software did you use?
You have a lot of “other” responses. Does this concern you?
Yes and no. Online surveys are known to have much higher rate of respondents answering ‘don’t know’ or ‘other’. People are less likely to offer an opinion when they don’t feel the personal pressure of an interviewer asking them to make a choice.
In my own interpretation of the data, I’m more comfortable comparing respondent answers to some other group, like another political party or the public at large. That is, I know the magnitude of the difference between groups, but less so the precise mean of their answers.
The question on equality has a very small sample size. Why?
The survey is constantly being updated. Some questions are added late. One question on equality, in particular, has a very small sample size (15) and should be interpreted with caution. I included it because I think it’s a fantastic representation of founders’ beliefs and, though the sample size is small, it differs dramatically from the public (the public sample is also small, from 31 Amazon Mechanical Turks. For more information on Mechanical Turks in survey research, you can read up here).
As a writer, I’m a fan of releasing information to the public and letting them decide if the data is trustworthy (noting he appropriate limitations), rather than make that decision for them and withhold the data.
If you would like to recreate the graphs with margin of error bars that resize based on each question, I’ve included code in the R file that will help.
Where can I get the data?
Results from survey 1 are here.
Download the raw file here.
Results from survey 2 are here.
Download the raw file here.
R code will be posted soon (you can email me for a messy version of it if you need it right away).
How was the survey conducted?
SurveyMonkey’s audience was polled on two surveys. The first poll includes most of the questions from the CrunchBase survey (above). The second poll adds questions related to 2016 candidates and college education.
The public opinion polling with Google Surveys was conducted through 15 different polls with varying zip-code targeting, wording, and response options.
The polling used in the article is entirely San Francisco-based, because I wasn’t confident that there were enough tech independent contractors working in other cities to constitute a reasonable representative sample. Between 3–6% of San Franciscans (or those taking the survey within a San Francisco zipcode) identify as working as an independent contractor for a tech company, as driver, delivery person, or some other low-skill position.
It’s worth noting that some of these sample sizes are quite small, but they are pulled from a very large sample of the population. To attain a sample of 200 self-identified gig economy workers, we had to poll ~1,600 people. Getting a traditional sample size of 500 gig workers for each poll would have required surveying 10s of thousands of respondents. So, there are logistical limitations to conducting a dozen polls targeting a fraction of the work force (the same limitations apply to polling sub-divided by industry occupation).
For those worried about sample size, I encourage you to combine the polls and run statistical significance tests. Please email me any interesting results (or contradictions) and perhaps I can promote them.
Here are links to the raw data for all files:
Immigration by gig economy employment
Where can I get the data?
Download the raw data for Crunchbase merged with Federal Election Commission data here
Here is a separate file for Crunchbase/FEC combined with separate columns for CrunchBase business categories.
Merging was done with the help of Dataladders.com. Non-stats folks should be aware that any large scale merge is likely to have have omissions and false positives.
Here is the file for all of the maps used in the book .
Democratic Leadership Map
The government leadership map combines data from the Martin Institute of Prosperity on proportion of creative class workers in an area, and Govtrack leadership scores.
All of the files are available here
R code here.
Optimal Law Coding
The data for recreating the graphs in the optimal coding chart is here
Data on creative class proportion of each congressional district are here. I used 2010 data, 3 years behind the law I coded (though the law was probably written near the data of the census data collection).
2013 laws were coded for every Democratic member of the House Judiciary Committee and select members of the Democratic Party (such as then-Senator Barack Obama).
The basics of the coding procedure was to look at whether a law was “optimal” or for “disadvantaged”. Optimal laws related to making enhancing citizens ability to solve social problems: getting STEM education, being better citizens, being healthier, protecting the environment or being a watchdog. We looked for performance-based funding and mandatory transparency among other traits.
“Disadvantaged” was any law that primarily targeted marginalized groups: minorities, low-income, or American businesses threatened by foreign competition.
A law can be both optimal and disadvantaged. This happened less often in the House and more often in the Senate, interestingly enough.
There are a lot of columns that I didn’t end up using in the analysis. We’re still in the middle of hand-coding nearly every member of the House and refining the coding protocol. Traditional social scientific guidelines call for multiple coders to have a agreement on ~70% all items. Because this was me and one other assistant, I urge caution with the interpretation and confined our analysis to just Judiciary members.
I then combined those scores with data from the Martin Institute of Prosperity, which categories the percent of “creative class” workers by region. Congressional districts don’t line up perfectly with economic regions, so see R code for details or send me questions via email.
Section II: Thoughts on limitations
It wouldn’t be honest to publish all this data without talking about some of the limitations and potential contradictions.
Political Spectrum limitations
Political categories are tricky business. Even political party has a poor correlation with actual belief. One that I mention in chapter on Democrats is that the aggressively small government faction of the Republican party, Libertarians, barely seem different than the public on issues about their key issues:
So, creating another political category, even if it seems recognizable, is going to have some inconsistencies. There is only a loose correlation between the big 4 questions in the survey about change, government’s role in world affairs, government’s role in personal decisions, and public competition between public services.
Theoretically, according to the typology, most people who believe in competition in public services, should also personally enjoy new experiences and be more optimistic about change. But, the differences are not always that strong:
I found that questions between collectivism and individualism had a much higher correlation than the order vs. progress spectrum (i.e. questions about innovation and change). I suspect this is because many respondents are used to questions about governments’ role in encouraging behavior.
Additionally, though Democrats often align very closely on policy issues (if someone is against free trade, their more likely to be supportive of labor unions), I found that the public only had a very weak correlation with different party factions.
This is sometimes why Political Scientists go to so-called “elite samples” of policy makers and other experts for opinion polling, rather than the public.
I think the evidence I presented in the chapters shows a general grouping within the proposed categories. But, weak correlations among groupings is the nature of American politics. Given this, it’s quite possible that someone could use my same data to come up with entirely different categories (perhaps they will).
The gig worker survey was difficult to get a good sample size on. Since only about 5% of workers identify as holding a tech independent contractor job, you need to poll 20x respondents. That adds up in costs quickly and there’s only so many people in San Francisco who can be polled.
If you look at the data on the gig workers, they are like tech workers on decreases on immigration reform, but not increases. Tech workers are more likely to say they support an increase. How this difference translates into support for current H1-B reform I don’t know. So, they seem less hostile to high-skilled immigration, they may not be as enthusiastic about more expansive policies.
On the other hand, the margin for gig worker for Hillary Clinton was quite high.
Chapter 3 — San Francisco Housing
Where can I get the raw files?
Here is a Google folder with the files: link. It includes raw images, data used to construct graphs, and blueprints for creating original versions of the san francisco city landscape.
What is the economic model based on?
Moody’s economist Mark Zandi’s forecasting model [PDF] estimates that a 1% increase in housing stock above population growth decreases prices by 10% for the entire Bay Area.
Zandi was careful to explain to me that his model applies to the entire bay area and not necessarily to San Francisco specifically. It’s possible that a different regions, cities, or neighborhoods could have a different price elasticity of housing supply.
It is notable that his model can roughly back-predict rising housing costs over the last decade in San Francisco, as the city has added more than 5,000 more people per year than available units. The graph below shows that housing prices have closely fallow population growth, when you take into account how much growth has exceeded housing stock. To perform this back-of-the-napkin calculation, I treated a 1% increase in population growth above unit construction as the same as a 1% destruction in housing stock. That is, as far as supply and demand are concerned, there’s little difference between adding more people without homes as there is destroying homes with an existing population.
The fit of the model also depends on population growth. The city counts different population than the US consensus. So, I emphasize that the model roughly back predicts changes in housing prices
San Francisco has roughly 370,000 total units, grows about 12,000 people per year, and has housing prices affordable to residents making somewhere north of $150,000. To decrease that number by half, the city could need about 20,000 more units above population growth.
What is the simulation based on?
We made very conservative estimates. Zandi told me that we needed around 130,000 more units over the entire Bay Area to keep housing affordable. I placed all of these in San Francisco and assumed that cheaper housing could cause an increase in population growth. So, we simulated San Francisco with 200,000 more units.
What are the limits of the simulation?
There are serious limitations to the idea that supply along can reduces housing costs in San Francisco in the short term.
Tim Cornwell of the development analysis compay, the Concord Group, tells me that the city has a deficit of labor. That is, there isn’t enough labor at the moment to build 100,000 units. Increasing capacity could likewise increase the cost of each unit.
Per unit costs are also a concern, since it’s unclear whether it will ever be cheap enough to build a unit at a price-point affordable to median-income residents. The cost of building a single apartment in San Francisco is around $430K (and up), when you factor in land costs and construction. This cost could go up if labor costs increase, or could go down if the city builds small “micro” apartments.
As of yet, there has not been a serious econometric analysis of the costs of units as supply is radically increased. The city would also need to factor in high-income immigration costs. On the other hand, construction in other cities could alleviate costs in San Francisco, wherein the model could overestimate the number of units needed.
This is to say this is a rough simulation and I hope it sparks a more thorogh analysis.