Linguistic Analysis of Twitter Posts Can Predict Heart Disease

When you connect your social media accounts to the first version of the Sherbit beta, you’ll be able to see a few data points describing your online activity, like a graph of ‘Likes’ on your status updates over time, or a map of your geo-tagged Instagram photos. With even this limited metadata, you can glean some useful insights about your social media presence — like what time of day your posts receive the most comments or ‘Retweets.’ For the time being, we’re restricted to these ‘active’ data points because that’s the only information available to us through the Facebook and Twitter developer APIs.

By the time we open the application to the public, we hope to have made partnerships with these social media sites to help you access more detailed information about your ‘passive’ behaviors — like the amount of time you spend browsing the News Feed, or using the Messenger app, and so on. With this kind of metadata in hand, researchers have been able to draw interesting conclusions about health risks associated with social media use; for example, behavioral scientists have learned that ‘lurking’ on Facebook has a strong correlation with deterioration in mood and depression.

However, a recent study shows that even more powerful inferences about your health can be made from the actual content of your social media posts. Researchers at the University of Pennsylvania analyzed the language of public Twitter posts to make inferences about negative emotions and social relationships — major risk factors for coronary heart disease. Using a cross-sectional regression analysis that looked exclusively at language patterns in social media data, the scientists were able to predict mortality rates from atherosclerotic heart disease (AHD) with greater accuracy than models that looked at demographic information and common risk factors like smoking and obesity.

In 2009 and 2010, Twitter made a random sample of 10% of all tweets available to academics — the researchers at U Penn were able to use self-reported location information from public Twitter profiles to map these tweets to specific counties. The Center for Disease Control provides access to mortality rates for heart disease at the county-level; the researchers averaged mortality rates from 2009 and 2010, to match the time period of the dataset they received from Twitter. Finally, they added demographic and health risk information obtained from the U.S. Census Bureau, to determine counties’ average income and graduation rate, distribution of gender and ethnicity, and prevalence of diabetes, obesity, smoking, and hypertension. The study encompassed more than 88% of the U.S. population, analyzing more than one thousand U.S. counties where at least 50,000 tweeted words were available.

The study used an algorithm to determine the frequency with which words and phrases appear. The researchers then calculated the relative frequencies of psychologically-related words and topics — phrases indicating anger, anxiety, positive and negative emotions, positive and negative relationships, etc. — and asked humans to evaluate a sample of the tweets to make sure that ambiguous words (for example, words used ironically) were accurately linked to the intended psychological concept. Finally, they created predictive models of mortality rates using each of the available predictors (psychologically-related words, demographic characteristics, health risk factors, and so on), to compare the accuracy of using different types of datasets. The results are telling:

The scientists concluded that negative-emotion and negative-relationship language was strongly correlated with increased risk of heart disease, whereas positive-emotion language was ‘protective’ against it. If they can be replicated, these results offer more evidence that analysis of social media content could provide useful epidemiological insights, even teaching us about psychological health at a large scale. This conclusion isn’t immediately intuitive — after all, as the study points out, “the typical Twitter user is younger than the typical person at risk for [heart disease]… The people tweeting are not the people dying.” The researchers suggested that the social media activity of young people must therefore strongly indicate general characteristics of the communities they live in — their shared norms and attitudes, economic conditions, and ‘psychological environments.’

Our goal is to develop technology for understanding your personal data — to build our analytics tools we are drawing from the insights derived from large-scale studies like the one performed at the University of Pennsylvania. For the first few months of the closed beta, we will not let you store and analyze the actual content of your tweets — we’ll just be testing out our data visualization algorithms, to give you some interesting ways to look at and play with your metadata. But down the line, we plan to make this kind of sophisticated health analysis tool available to you: in the future, we want you to be able to use Sherbit to graph your mood over time, based on inferences made from your social media activity. It’ll take us some time to get there, but we’re excited about the possibilities — sign up for our waitlist to keep updated on our progress!

Get access to the Sherbit app here