Explore Patterns of Vaccination Influences in Social Media Part 1 (Times Series Analysis)

4 min readApr 29, 2019

Background

The vaccination programs generally require the collection of detailed information about a population’s vaccine-related beliefs and behaviors. Understanding the compliance and refusal of vaccination and its motivation is particularly important in establishing effective health communication systems. The traditional way to get that information is through telephone survey and panels. However, those methods are too slow to get real-time information and can underrepresent the young, urban participants and minorities. Therefore, the real-time social media, such as Twitter provide an alternative way. We are attempting to find a new effective and real-time way for vaccination programs to collect large range information of vaccination influence.

Data

The project’s metadata are based on two parts, official government data from the US Centers for Disease Control and Prevention (CDC) on influenza vaccination and tweets which contains keywords, like “flu and vaccination”, from twitter streaming API. For part one, I was working on the Data preprocess of those two parts of data for future data analysis or machine learning classifier training data. The vaccine data includes vaccination coverage by month, by geographic regions defined by the US Department of Health and Human Services (HHS), and by demographic group.

Data Preprocessing

The original twitter data was given by Xiaolei. My work is to clean the twitter data. The Twitter data was a mess when I get it. Xiaolei requires me to do the following steps to clean the data:

1. Hyperlinks, hashtags, mentioned users were replaced by <url>, <hashtag>, <user> respectively.

2. Repeated punctuations were replaced by “punctuation <repeat>”.

3. Each tweet was tokenized by NLTK (bird2009natural).

4. Words in each tweet were lowercased.

The original data contained 1,007,582 tweets, but only need to clean the 10000 tweets for the next step. It is still a larger amount of data. The cleaned data was animated into the following columns:

Twitter_Id,Twitter_content,relevant,intent,intent_what,intent_who,sentiment. Here is an example.

Twitter_Id: 2.7758821473323E+017

Twitter_content: I get my flu shot on December 21st A FLU SHOT AIN’T GON SAVE ME FROM THE APOCALYPSE MUTHAFUCKA

Relevant: yes

Intent: yes

Intent_what:yes_received

Intent_who:[‘yes_author’]

Sentiment:negative

The “intent” column would be the most important column for the training in the next step. It is time-consuming to clean, animate and organize the data. We put it on Amazon Mechanical Turk to help us animate data. We randomly picked 10,000 tweets: each tweet was annotated by 3 annotators Then discard 3 low-quality annotators and majority vote. Then organize them together and preparing for the training data.

Analysis

For the next step, I used Tfidfvectorizer model from Sklearn to vectorize tweets data. The method called Bi-gram (A bigram is a sequence of two adjacent elements from a string of tokens, which are typical letters, syllables, or words. A bigram is an n-gram for n=2.). Then put training data in Logistics Regression Model.

Pearson correlation on the Twitter counts and CDC data VS LR classifier(Higher F1 score do correlate with higher Person correlation but not always)

F1-score (Logistic Regression)

Receipt/intent vs other (intent): 0.82

(The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.)

For analysis of vaccination pattern by time series, I conducted the counted both the weekly and monthly number of tweets classified as“intention/receipt”. The data from Twitter and CDC were normalized by z-score separately.

b.Then ran the time series model, “autoregressive integrated moving average” (ARIMA). The result suggested a linear relationship between the trends of CDC and Twitter. We then fitted the time series data by a linear regression model using Twitter trends to predict CDC trends.

c. Additionally calculated Pearson correlation on the Twitter counts and CDC data.

The figure shows the LR time series and CDC data. There are only minor differences in the trends of the two models. Notice that each peak of the plots is usually in October of the flu season. Yet, there is a distinct peak between Jan. 2014 and Feb. 2014, which might indicate many people also talked about taking flu vaccination shots during that time.

Forecasting

Figure 4. Estimates a trend and an annually cyclical pattern.

Here are predictions made by Prophet. This (Figue3 )includes the historical data (black points), the line connecting these points, and the forecast into the future with errors. Figure 4 includes the estimates of trend and an annually cyclical pattern. We can see both prediction figures shows an increasing trend for the vaccination recipes.

Credit to Xiaolei Huang and Michael J.Paul’s help on this project.

Reference

Examining Patterns of Influenza Vaccination in Social Media

Xiaolei Huang, Michael C. Smith, Michael J. Paul, Dmytro Ryzhkov, Sandra C. Quinn, David A. Broniatowski, Mark Dredze