The Leon Project: Big Data for Social Science
I’ve decided to go public with the key elements of a project I’ve been working on. Consider the following an informal project summary.
The Leon Project (named after Leon Festinger) will, among other things, integrate massive social science data and metadata into a huge online public repository. It will unlock much more powerful literature search, and deliver a real-time meta-analysis engine. Leon will enable greater insight into minority populations than status quo research methods, and will likely facilitate countless substantive discoveries in traditional social science domains. My focus here is social psychology, but Leon will likely extend to scientific psychology more broadly, and to other social science disciplines.
The status quo approach to social science rests on large numbers of discrete studies published in the form of peer-reviewed journal articles. These articles are written in long prose form, often in a narrative, story-telling or story-selling structure, and persist as standalone prose objects that are cited in subsequent prose objects.
The data behind these studies has generally been unavailable by default — it would have to be requested in every case, and requests are not always successful. The field is advancing in this respect, and data will increasingly be posted by journals. However, this still leaves the data for separate studies isolated from each other.
At present, we can only search the literature based on trivialities like authors, journals, titles, and a few keywords. This is barbaric. We lack the substantive semantics we need — we cannot search by hypotheses, methods, results, analyses, nature of samples, location, variables, predictors, DVs, etc.
Most research is based on student samples from one American university. This has a number of implications, one of which is that most of the participants are white or Asian. There are very few minorities in our samples, which is of course the very nature of being a minority, but in our research minorities are underrepresented even compared to their population base rates. As a result, it is unclear whether the effects commonly reported by researchers apply to various minority groups, or only to white Americans (or only to white American college students who tend to enroll in Intro to Psychology courses — see Henrich, Heine, and Norenzayan, 2010.)
What Leon Does
Leon will integrate research data and metadata in one online repository, and optionally the articles as well (contingent on copyright issues.) Integrating the data is somewhat trivial — we can simply upload all the data is posted for contemporary studies. The metadata Leon will offer is completely new.
Published articles will be coded on multiple dimensions. A new XML schema designed for the purpose will allow us to classify hypotheses, methods, nature of samples, variables, etc. The schema is currently under development, and I’ve purchased some .science domains for websites that will host it. (The .science TLD was introduced by the internet governing authorities last year — it’s a Top Level Domain, just like .com or .edu. For example, your personal website could be SpeedyGonzalez.science if you wanted.) t
Example: Consider a hypothesis that anticipated emotional exhaustion motivates dehumanization (as in Cameron, Harris, and Payne, 2015). The new XML schema will identify this as a hypothesis qua hypothesis, and will extract the predictor and the DV, like so:
<predictor>anticipated emotional exhaustion</predictor>
(Assume that semantically equivalent hypotheses are covered too.)
Right now, we can’t search by predictor or by DV (and the predictor here is not a keyword on the paper — the utility of our keyword scheme is quite limited.) Nor can we search by all predictors that are anticipated states or burdens, which would be incredibly useful — we don’t have the semantics, the tools, or the epistemic taxonomies. Leon makes all of this much easier.
The above is a trivial example — I encourage you to think of the larger implications here. Think of the dimensions, the epistemic taxonomies, we could leverage. Even in this example, we can see some interesting potential dimensions. Like many social psychology hypotheses, there’s no sense of how common that predictor is — the spontaneous rate of anticipated emotional exhaustion in the relevant contexts. Nor is there any sense of the probability that a person who experiences this anticipated exhaustion will dehumanize the target individuals. (These issues are why I created Duarte’s q.)
And of course “dehumanization” is extremely abstract and vulnerable to the stipulations and values of any given research team — there are a number of dimensions we could apply to our analyses of highly abstract variables like these, and Leon makes it easier to work with them. (FYI, I don’t see the Cameron study as particularly problematic on any of these fronts — it’s pretty good.) Combined with the datasets, Leon also makes it easier to explore questions about real-world base rates and what we might term synthetic instantiation phenomena.
(If the above paragraph doesn’t make sense to you, be patient with me — I’ll unpack these ideas in upcoming work.)
With the data integration, Leon unlocks interesting possibilities and even new research methods. One easy win is that it gives us large minority samples. Now the seven African-Americans in one study, the five in another, and the eleven in each of eighteen other studies can be combined. There will be many datasets that include variables like the Big Five personality traits, self-esteem, SES, PANAS, and so forth. When multiple datasets include common variables, the datasets can be combined on those variables and the token minorities aggregated into usable samples. Now we can see if the same positive emotions do the same work, or relate to the same personality traits, in non-white samples. Now we can see a lot of things that we couldn’t see before.
The win for minority research is a fraction of what Leon’s data aggregation enables. I mention the minority research angle because that was one of my goals in mapping out Leon. You should be thinking of a lot of other possibilities.
In concert with the data integration, Leon enables a whole new world of textual analysis for social science. Ultimately, Leon will have the materials and inductions for a vast number of studies — after all, this is rightly part of what we expect when we require that researchers post their “data.” There is a whole world of brilliant research out there focused on text analysis, lexical analysis, text corpora building, text mining, and so forth. Linguists are underappreciated geniuses. (Well, we could say this about most scientific fields.) A key goal for Leon is to spark interdisciplinary collaboration. Social psychology can be revolutionized by large scale textual analysis. (Note also the potential for text mining of the journal articles themselves.)
If you don’t know what I’m talking about here, just look into the respective literature around textual analysis, ecological corpora building, etc. These domains are probably unfamiliar to most social psychologists, but trust me, it’s not going to be hard for you to grasp the work. What linguists and private sector researchers have been doing is right up our alley.
Bonuses: Seasonal and regional effects will become easier to detect. The oft-speculated end-of-semester participant effects. Weather effects. And lots of other effects I’m not going to go into now, but I’m extremely confident will explain variance. Latent class analyses that give us profiles of different types of participants will be much more powerful with Leon — LCA is a big advance over our standard correlation methods.
I spent several years in the software industry before entering social psychology. From a technical standpoint, Leon is surprisingly modest. It is a database, web server, and application server. Physically, it might require a dozen rack servers at most. The amount of data involved is large by social psychology standards, but quite modest by Big Data standards. In Stage 2, Leon will extend well beyond the features discussed here, and will likely require more compute capacity. For Stage 1, however, Leon can be deployed by three or four experienced web developers and one database analyst. This is not a Healthcare.gov scale project — it’s more like the Open Science Framework (and in fact, we should explore a collaboration with OSF.)
Some of the article metadata coding can be automated, but some will have to be manual. This will require a crowdsourced effort. I think the best approach would be to require graduate students to code a certain number of articles — e.g. code the hypotheses, the nature of the samples, etc. We are going to have to do this work at some point anyway — it’s surprising we haven’t coded and organized our knowledge base along these dimensions already. The field is already toying with the idea of requiring graduate students to attempt at least one replication of published studies in order to earn a doctorate. Requiring graduate students to code a few articles should be somewhat less controversial, and flows naturally out of the work they’re already doing in graduate seminars.
Potential funders include Templeton, the National Science Foundation, Google, and Microsoft. It would be a good idea to seek collaboration with the Open Science Foundation, CurateScience.org, and similar efforts, as well as existing journal databases like PsychInfo. Specific funding needs will be sorted out by early 2016.
Leon is what social science circa 2015 should be. These are the tools we should have, and I think we’ll take all of this for granted in 2030. Scientific knowledge right now is fractured and unintegrated. Current methodological standards in social psychology are far too low to be able to generate credible claims about human behavior from typical standalone papers — integrating the literature and data will partly overcome this problem. Leon extends far beyond this by enabling the massive acceleration of substantive research into minority populations and countless novel hypotheses. As social psychology corrects its methodological issues, Leon becomes more powerful still.