Using Watson Debater to Measure Polarisation in Social Media Conversation
Social media is a platform for hosting a wide range of debates about popular topics. It is of interest to measure the level of polarisation present in these debates. This post will investigate how social media polarisation is influenced when activity by synthetic actors/ bots is present. The dataset will be restricted to a week’s of worth of data from the Twitter social media platform (16/07/2020 to 23/07/2020) and the focus will be predominantly on the Black Lives Matter conversation. This topic is selected because of its contemporary relevance and its likelihood to be naturally polarising.
Our analysis aims to emulate a large part of the methodology that Gallacher and Heerdink (2019) adopt in their paper ‘Measuring the effect of Russian Internet Research Agency Information Operations in Online Conversations’. The inspiration for our piece of work as well as the theoretical framework was provided by this paper and the reader should consult this excellent work for full details. We have made some adjustments to the methodology to account for our data limitations and make use of Watson Debater’s Pro-Con service which we believe is a good fit for this type of analysis.
IBM’s Project Debater is the first AI system that can debate humans on complex topics, with the first commercial release of Debater under the umbrella of other Watson AI services in March 2020. The full power of Watson Debater is to digest text pertaining to a particular topic, generate a clear and well-constructed argument for or against the topic and ultimately rebut its opponent. For the purpose of our analysis we only made use of its Pro-Con service, which scores text based on whether or not it is in favour of a topic. In this case we were scoring tweets on whether or not they were in support of the Black Lives Matter movement, with the ultimate goal of measuring polarisation in this conversation.
If you would like to learn more about Project Debater, which is truly the next milestone in IBM’s AI endeavours since the likes of Deep Blue and ‘Watson on Jeopardy!’, head to the link below.
The main audience for this work is data scientists who are looking to integrate NLP into their work for similar problems relating to user sentiment about a given topic and to demonstrate that the Watson Debater tool is robust and capable to reliable address this problem
Original Paper Methodology
Before we can speak about the approach taken with our analysis, it will be instructive to summarise the methodology taken in the Gallacher and Heerdink paper — we can use this as a reference to explain the adjustments we have made.
All analysis performed by Gallacher and Heerdink is reported daily: therefore, all statistics reported are daily measures. This section gives a brief overview of the methodology — the reader should consult the paper for full details.
The data was collected from Twitter relating to the Black Lives Matter conversation. More precisely, if a tweet contained a ‘BlackLivesMatter’ or ‘BLM’ hashtag, it is considered relevant to the conversation. The objective of the Gallacher and Heerdink study was to test if Russian Troll Farm activity significantly affects the level of polarisation in a particular conversation. In order to determine the degree to which particular tweets express support for the Black Lives Matter movement — a technique called Correspondence Analysis (CA) was used.
Correspondence analysis is a geometric approach for visualising the rows and columns of a two-way contingency table as points in a low-dimensional space, such that the positions of the row and column points are consistent with their associations in the table. In this case the two-way contingency table is a retweet matrix. The CA returns principal component dimensions, the first dimension is constructed to explain the largest amount of variance in the data. The scores across the 1st principal component is taken to quantify the degree to which a tweet expresses sentiment supporting Black Lives Matter.
For any given user, the mean across all their tweet scores is taken to obtain a user level score. The distribution of the user scores is what is analysed to assess the level of polarisation in the conversation. If the user scores distribution is mostly uniform, then this indicates that users do not exhibit extreme values with little overlap (polarised views). If the user score distribution is bimodal with very little overlap, this indicates that opinions are forming in distinct camps with very little overlap — this is a sign of polarisation.
In order to test the modality of the distribution of user scores, the Hartigan’s dip test is used. The Hartigan’s dip test measures the multimodality of a particular distribution. The test statistic — referred to as the D statistic is operationalised to represent the level of polarisation by day. That is, for each day, a value for the D statistic is reported which represents conversation polarisation.
Making adjustments with Watson Debater
Our aim is to replicate the study by Gallacher and Heerdink but to use the Pro-Con service from Watson Debater to measure the degree to which a tweet agrees of disagrees with a topic of interest. We did this for 2 reasons:
1. There is no guarantee that the 1st Principal Component dimension from the Correspondence Analysis would give a reliable indication of the level of support or opposition for the Black Lives Matter movement indicated by the tweet. In our case, we performed Correspondence Analysis according to the procedure mentioned in the paper, but we did not observe that the Correspondence Analysis scores indicated level of support. We found by manual inspection that extremely high or low principal component scores did not correlate with tweets with extremely positive or negative sentiment.
2. The Pro-Con service is built for these types of problems and it lends itself naturally to this task.
The Pro-Con service with Watson Debater works in the following way — given a pair of strings <sentence; topic>, the Pro-Con service scores the extent to which sentence supports topic. The score is a real number ranging from -1, indicating that sentence most strongly opposes topic, to +1, indicating that sentence most strongly supports topic. Scores close to 0 indicate that the sentence is neutral with respect to topic. A positive score indicates that sentence is pro topic: the larger the positive score, the stronger the support that sentence provides to topic. A negative score indicates that sentence is con topic: the smaller the negative score (the more negative), the stronger the con aspect expressed in sentence toward topic.
Unlike the study, we did not have access to accounts who were confirmed as trolls. Therefore, we decided to explore if accounts who are classified as bots affect the daily level of polarisation within a conversation. To identify which accounts in our dataset were bots, we used a pre-trained classifier tweetbotornot which is a way to detect Twitter bots via Machine Learning.
We also have data limitations compared to the paper and therefore our analysis should not be seen as robust or representative — rather, the objective is to demonstrate the methodology for measuring polarisation using Watson Debater. If we had a larger dataset, we can apply the pipeline listed here to measure the connection between conversation polarisation and trolls or bots (however they’re defined).
Methodology and Results
Our approach taken was consistent with that of Gallacher and Heerdink with the notable exceptions already mentioned. For completeness we summarise our workflow in this section and present some results. Firstly, we load Twitter data into a database, the data is composed of 2 parts:
1. Tweets relating to hashtags of interest
2. User table containing results from the pre-trained bot classifier. This determines if a given twitter user is considered a bot or not.
Subsequently we join the tables together, so we have a single table where each row represents a tweet and we have an indicator variable signalling if a given user is a bot.
The text for each tweet must be cleaned to prepare the tweets to be fed to the Watson Debater Pro-Con service. Once prepared, we score the tweets in the dataset using the Debater Pro-Con score. Since the service takes (sentence, topic) as an input pair — we chose “Support Black Lives Matter” as the topic. The Pro-Con service will output a score between -1 and 1 indicating the extent to which a given tweet supports the topic “Support Black Lives Matter”.
We take the mean of the Pro-Con scores for all the tweets belonging to a given user to obtain a user level score. The distribution of these scores is what is used to compute the level of polarisation in the conversation. We compute the Hartigan’s dip test and report the test statistic as the level of polarisation per day. Finally, we plot the daily polarisation level with the number of bots on the same day. If we had more data, we could replicate the original study’s extensive permutation testing to test the relationship between number of bots and polarisation. We could have also experimented with Granger Causality and Causal time series approaches. The objective with these approaches would be to explore in a more systematic way if there is a correlation between polarisation and the number of bots.
The above plots show the output of the 7-day analysis. When we see the Pro-Con user distribution score, we that the conversation is unimodal around zero and therefore is not polarised. If we saw distinct bumps at the extremes of the x-axis, we would have a strong indication that the users are expressing sentiment which is polarising. The line graphs show the daily level of polarisation and number of bots — there seems to be no relationship between the variables, but a larger sample of data taken at appropriate times may display a pattern between the variables.
We also considered the effect of bot activity across different conversations (represented by different hashtags). We used the Jensen Shannon (JS) divergence between the Pro-Con score distribution generated by tweets from bots vs the Pro-Con score distribution generated by tweets from non-bots. To do this we separated the Pro-Con user distributions, making distinctions between users who are bots vs those who are not. We reported the daily JS divergence score to get an indication on whether the divergence was of interest. Since there is not a single threshold for this, the analyst must decide if the divergence is significant to warrant further investigation.
The table above compares the “BlackLivesMatter” and “5G” conversations by looking at the Debater Pro-Con score distributions. Values close to 0 indicate that there is little difference between the distributions between bots and not bots — this test can be important if large divergences are observed. Large divergences warrant a deeper investigation into the types of messages bot accounts are tweeting and retweeting vs authentic/ not bot users.
As the influence of social media continues to rise, it is important to develop a set of tools and methodologies which allow us to analyse how users are interacting online. Polarisation of conversations online is a fascinating phenomenon to study because it can yield insights about public opinion around a topic of interest. Gallacher and Heerdink offer a methodology that can be adopted to obtain an understanding about conversation polarisation. Combining this approach with a powerful purpose-built service like Watson Pro-Con Debater allows the user to generate relevant and robust statistical distributions which can be examined for deeper insight. Enterprise could use this type of analysis for a variety of uses such as measuring the adoption of corporate strategies by employees or customer discussion around their products. The Debater Pro-Con service lends itself naturally to analysis of this type and this post has only begun to describe its far-reaching capability and utility.
Appendix A: Areas for Further Development
We have identified some key areas for further development consistent with the methodology described in this post:
1. The data volume has been the biggest challenge in this work. If we could obtain a larger data volume for a significantly long period of time, we could see more interesting and meaningful results. This is a perennial challenge when using the Twitter API.
2. Investigation of methods that can measure the link between number of bots and level of polarisation. There are many approaches we could have looked into if there was more data. The Gallacher and Heerdink paper performed permutation testing with rolling 7 day averages across a 20 day post period. Granger Causality, intervention analysis and other time series methods could also have been investigated to explore the nature of the relationship between conversation polarisation and the level of bot activity.
3. Explore different statistical methods to measure polarisation — the Hartigan’s dip test measures multimodality, not strictly bimodality. Since the Debater Pro-Con score distribution is bounded between -1 and 1, it would be interesting to explore polarisation methods which utilise this property.
4. Explore difference divergence metrics when comparing bot and not bot distributions across different conversations. JS divergence has the benefit of being symmetric but many other distance metrics exist and a study comparing different metrics would be fruitful.
1. Gallacher, J , Heerdink, M. (2019). MEASURING THE EFFECT OF RUSSIAN INTERNET RESEARCH AGENCY INFORMATION OPERATIONS IN ONLINE CONVERSATIONS. Defence Strategic Communications. 6, 155–198.
2. Bot classifier details: https://github.com/mkearney/tweetbotornot