Social Media Intelligence Case Study: An OSINT Data Engineer’s Investigation into American Right-Wing Communities Targeted by Russian Disinformation

Eric Brichetto
22 min readJun 16, 2023

--

In the digital age, the battle for hearts and minds is often fought on the frontlines of social media. This article is a deep dive into a personal project that examined the dissemination of disinformation within the American right-wing online ecosystem. The sources of this disinformation are two American self-styled journalists: Patrick Lancaster and Matt Couch, the owner and primary author of thedcpatriot.com, a notorious far-right trash news website.

Introduction

Patrick Lancaster

Patrick Lancaster is an American self-styled journalist and vlogger who has been reporting from the front lines of the Russian invasions of Ukraine since 2014. He rose to some prominence in the early weeks of the Russo-Ukraine war, and has been assessed to be a disinformation actor by multiple independent groups and experts, including Bellingcat and the Institute for Strategic Dialogue.

Matt Couch

Matt Couch is an American media entrepreneur, founder of America First Media, and chief operator of the low-quality “trash news” website The DC Patriot (thedcpatriot[dot]com). Media Bias Fact Check has rated his The DC Patriot website as “Far-Right Biased and Questionable based on the promotion of conspiracy theories and propaganda and the use of poor sources, a lack of transparency, and false claims” (https://mediabiasfactcheck.com/the-dc-patriot/, emphasis in original).

Disinformation Content

In early April 2022, Couch recorded a video interview with Lancaster in which Lancaster claimed that Ukrainian civilians in Mariupol had been attacked not by Russian military forces, but rather by Ukrainian troops in the Azov Battalion — a false narrative that has been repeatedly debunked since and was provided with no evidence. This video interview provided the source material for an article that was published on April 4 2022 on The DC Patriot parroting the same claim (https://thedcpatriot[dot]com/watch-ukrainians-tell-reporter-on-ground-in-mariupol-its-not-russians-its-ukrainians-azov-attacking-civilians-patrick-lancaster-news-reports/).

Share and interaction data available for that article from the free CrowdTangle Chrome browser extension (collected April 26 2022 and still publicly available) indicated that this article had been shared on Facebook into three notable groups:

PRESIDENT DONALD TRUMP (20k+ followers), https://www.facebook.com/groups/515283525287467/permalink/2302022263280242

Woo the People Group (3.2k followers), https://www.facebook.com/groups/355954155643336/permalink/691700722068676

RED WHITE, and WE THE PEOPLE back the BLUE & PRESIDENT TRUMP (4.3k followers), https://www.facebook.com/groups/3838333272856171/permalink/5641395732549907

This article (as well as others from The DC Patriot) was shared into these groups by a Facebook account by the name of Victoria Kentz, which has no identifying information listed, is highly anonymous in the information it does disclose, and whose activity I found to consist only of changing its profile picture and posting at high frequency into groups such as the above. These characteristics are strongly suggestive of either a compromised or outright fake account. If so, then there is a question of whether the activity outlined above constituted coordinated inauthentic behavior.

The Investigation Goals

The central question driving this project is whether the dissemination of content from Lancaster and Matt Couch to three chosen Facebook groups meets the criteria and characteristics of an influence operation. More specifically, does it exhibit signs of coordinated inauthentic behavior? Does it violate the Dangerous Organizations and Individuals community standard on Facebook? If so, how can we characterize and assess the operation using the SCOTCH framework, and what can we learn about the dynamics of suspected inauthentic behavior in these Facebook groups?

As a sub-goal of this project, I sought to analyze the high-level demographics, engagement metrics, and basic inter-group dynamics of the three Facebook groups targeted by false news articles produced by Matt Couch, which included their geographic distribution. This sub-goal of the aforementioned investigation grew into a case study of analysis and data engineering of equal size and scope as my original plan for the entire project. That endeavor is the focus of this article. My hope is that this report will serve as a valuable case study in the blending of open-source intelligence (OSINT), social media intelligence (SOCMINT), and data engineering to enable real-world investigations.

In the following sections, I will delve deeper into these questions, dissecting the complex web of social media disinformation and the (surprising) makeup of its target audiences.

The code shared below can be viewed in my corresponding GitHub repository.

The Investigation Blueprint: A Journey into the Heart of Disinformation

Every investigation begins with a plan, a roadmap that guides the researcher through the labyrinth of data and information. In this case, the project plan was meticulously designed to dissect the complex web of disinformation spread by Patrick Lancaster and Matt Couch.

The first step was to define the problem statement, a crucial aspect that set the direction for the entire investigation. The goal was to determine whether the content disseminated by Lancaster and Couch to three specific Facebook groups exhibited signs of an influence operation, coordinated inauthentic behavior, and violations of Facebook’s community standards.

Data gathering was the next step, a process that involved sourcing information from various channels. These included Lancaster’s own platforms, such as his website and YouTube channel, and thedcpatriot.com, Matt Couch’s trash news website. The content from these sources, along with the member lists of the three targeted Facebook groups, formed the core data for the investigation.

Data cleaning followed, ensuring that the information was accurate, consistent, and usable for the subsequent stages of the project. The data enrichment phase, however, proved to be a significant challenge. The plan was to extract the claimed geographic locations from all Facebook group member accounts and label user accounts that appeared to be bot, sockpuppet, and/or hijacked accounts. While the geographic extraction was successful, the labeling of potentially inauthentic accounts was a hurdle that could not be overcome within the project’s timeframe.

Despite this, the data collected and processed provided a rich foundation for the planned descriptive and diagnostic analysis. The aim was to understand the objectives of the disinformation campaign, assess its success or failure, and create visualizations to illustrate the findings. An optional step was to attempt an attribution, identifying the possible actors behind the campaign.

While some of these steps have been left for future investigations, the project still yielded valuable insights into social mechanics of disinformation within the American right-wing online ecosystem.

The SCOTCH Framework: A Tool for Dissecting Influence Operations

In the realm of influence operations, understanding the intricacies of a campaign is crucial. One tool that can aid in this process is the SCOTCH Framework, developed by the Atlantic Council. This framework provides a comprehensive yet succinct way to characterize influence operations, making it an invaluable tool for analysts, researchers, and policymakers.

The SCOTCH Framework is an acronym that stands for:

  • Source: The origin of a campaign, which could be individuals, bots, or even the platform itself.
  • Channel: The platform and its features used to disseminate the campaign.
  • Objective: The goal of the operation, which may be obfuscated and require careful inference.
  • Target: The intended audience of the campaign, which can be defined demographically, geographically, or by shared beliefs.
  • Composition: The specific language or image content used in the operation.
  • Hook: The technical mechanisms of exploitation and the psychological phenomena being exploited.

In the context of this investigation, the SCOTCH Framework was intended to be a guide for characterizing the dissemination of content from Patrick Lancaster and Matt Couch to three chosen Facebook groups. In the context of our case study into the group dynamics of the targeted groups, it informed the breakdown of data sources to collect from and provided a structured approach to the investigation.

For example, the ‘Source’ would be Lancaster and Couch, the ‘Channel’ would be their respective platforms and the suspected hijacked account amplifying their content into the Facebook groups, and the ‘Target’ would be the members of these groups. The ‘Objective’, ‘Composition’, and ‘Hook’ would be analyzed based on the content and tactics used.

Despite the challenges encountered, the SCOTCH Framework remains a valuable tool for understanding and dissecting influence operations. It provides a comprehensive view of an operation, allowing for easy comparison between distinct operations and offering decision-makers the information needed to understand an operation and take appropriate counteractions.

Behind the Scenes: Data Scraping and Transformation

In the world of data analysis, the process of gathering and preparing data is often the most time-consuming and challenging part of a project. This investigation was no exception. To scrape the list of members in the Facebook groups, I utilized Burp Suite Proxy Interceptor to scroll through the full member list in each group and save the Facebook GraphQL API responses to local files in XML format.

Once the data was collected, the next step was to parse this GraphQL XML data and transform its contents into a more usable format. For this, I wrote a Python script that converted the data into JSONL (line-delimited JSON) files. This script was designed to decode the base64 encoded responses, extract the relevant data, and write it to a new file in the JSONL format.

Here is the Python script used for this process:

import json
import base64
import argparse


def parse_responses(source_file: str, dest_file: str, mode: str):
arr = {f'{dest_file.split(".")[0]}':[]} if mode == 'json' else None
decoded_resp_arr = []

with open (source_file, 'r') as f:
splitted = f.readlines()
for line in splitted:
if line.strip().startswith('<response base64="true">') and not line.strip().startswith('<response base64="true"></response'):
try:
resp = line.strip().removeprefix('<response base64="true"><![CDATA[').split("]]><")
resp_line = base64.b64decode(resp[0]).decode('utf-8')
resp_splitted = resp_line.split("\r\n\r\n")


for line2 in resp_splitted:
if line2.startswith('{"data":'):
next = line2.split("]]><")
try:
if mode == 'jsonl':
decoded_resp_arr.append(next[0] + '\n')
elif mode == 'json':
a = json.loads(next[0]) #a['data']['node']['group_member_discovery']['edges'] i['node']['name']/['url']/['profile_picture']['uri']/['bio_text']['text'](OR ['timeline_context_items']['edges'] i['node']['title']['text'])
for i in a['data']['node']['new_forum_members']['edges']: # a['data']['node']['comet_hovercard_renderer']['user'] ['name']/['url']/[profile_picture]['uri']
_ = [x['text'] for x in i['node']['bio_text'] if x is not None]
description = _[0] if _ else None
arr[f'{dest_file.split(".")[0]}'].append({"name":i['node']['name'],"link":i['node']['url'],"profile_pic":i['node']['profile_picture']['uri'],"joined":i['membership']['join_status_text']['text'], "description": description})
except Exception as e:
print(e)
except Exception as ex:
print(ex)

if mode == 'jsonl':
with open(dest_file, 'w') as f:
for item in decoded_resp_arr:
f.write(item)
elif mode == 'json':
with open(dest_file, 'w') as f:
json.dump(arr, f, indent=4)



if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--source_file', type=str, help='Source file path')
parser.add_argument('--dest_file', type=str, help='Destination file path')
parser.add_argument('--mode', type=str, choices=['json', 'jsonl'], help='Output mode')

args = parser.parse_args()

parse_responses(args.source_file, args.dest_file, args.mode)

This provided line-delimited JSON records where the JSON objects had structure like the following:

{
"data":{
"node":{
"__typename":"Group",
"group_member_discovery":{
"edges":[],
"page_info":{
"end_cursor":"bucket_item:1:12:[REDACTED]",
"has_next_page":true
}
},
"id":"[REDACTED]"
}
},
"extensions":{
"is_final":true
}
},
{
"data": {
"node": {
"__typename": "Group",
"id": "[REDACTED]",
"new_forum_members": {
"edges": [
{
"membership": {
"invite_status_text": null,
"join_status_text": {
"delight_ranges": [],
"image_ranges": [],
"inline_style_ranges": [],
"aggregated_ranges": [],
"ranges": [],
"color_ranges": [],
"text": "Joined about 2 weeks ago"
},
"id": "[REDACTED]"
},
"node": {
"__typename": "User",
"id": "[REDACTED]",
"__isProfile": "User",
"name": "[REDACTED]",
"is_verified": false,
"__isEntity": "User",
"work_foreign_entity_info": null,
"url": "https://www.facebook.com/example.user",
"__isGroupMember": "User",
"group_membership": {
"has_member_feed": true,
"associated_group": {
"id": "[REDACTED]",
"leaders_engagement_logging_settings": {
"comet_surface_mappings": [
{
"__typename": "GroupLeadersEngagementLoggingExactCometSurfaceMapping",
"surface": "MALL",
"trace_policy": "comet.group"
},
{...}
]
}
},
"id": "[REDACTED]",
"user_signals_info": {},
"member_actions": []
},
"profile_picture": {
"uri": "[REDACTED]",
"width": 60,
"height": 60,
"scale": 1
},
"bio_text": null,
"user_type_renderer": {}
},
"cursor": null
},

After converting the batches of data from XML into JSONL format, I also extracted it into a much-simplified data model in a standard JSON array, using a separate Python script. This step was necessary to make the data easier to work with in the subsequent stages of the project.

Here is the Python script used for this process:

import argparse
import json
import base64
from typing import List
from nullsafe import undefined, _

def parse_graphql_file(file: str) -> List[dict]:
arr = []
with open(file, "r") as f:
split_lines = f.readlines()
for line in split_lines:
if line.startswith('{"data":{"group":'):
try:
a = json.loads(line)
for i in a['data']['group']['group_admin_profiles']['edges']:
arr.append({"name":i['node']['name'],"link":i['node']['url'],"profile_pic":i['node']['profile_picture']['uri'],"role":'admin/moderator'})

except Exception as ex:
print(ex)

elif line.startswith('{"data":{"node":'):
try:
a = json.loads(line)
for i in a['data']['node']['new_forum_members']['edges']:
description = _(i['node']['bio_text'])['text']
arr.append(
{
"name":i['node']['name'],
"link":i['node']['url'],
"profile_pic":i['node']['profile_picture']['uri'],
"joined":i['membership']['join_status_text']['text'],
"description":description if description is not undefined else None
}
)

except Exception as ex:
print(ex)

return arr

def parse_and_save_files(source_files: List[str], dest_files: List[str], group_names: List[str]):
for source_file, dest_file, group_name in zip(source_files, dest_files, group_names):
members_arr = {group_name: []}
members_arr[group_name].extend(parse_graphql_file(source_file))

with open(dest_file, 'w') as f:
json.dump(members_arr, f, indent=4)

if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('--source_files', nargs='+', help='Source file paths')
parser.add_argument('--dest_files', nargs='+', help='Destination file paths')
parser.add_argument('--group_names', nargs='+', help='Group names')

args = parser.parse_args()

parse_and_save_files(args.source_files, args.dest_files, args.group_names)

Data Processing: From Raw Data to Network Analysis

Once the data was collected and transformed into a usable format, the next step was to process the data and prepare it for network analysis. This involved reading the data for each group from its corresponding .json file, computing the intersection of group members shared between the three groups, and creating node and edge lists for all groups and their members.

Initially, I exported the data into a list for importing into the Memgraph graph database. However, I found that Memgraph could not handle this much data in a Memgraph Lab visualization in a performant manner. Therefore, I decided to merge the data together and export it into a single list of nodes and edges in the file network.json for importing into Neo4j.

Here is the Python code used for this process, which can be viewed in greater detail in the Jupyter notebook network-mining.ipynb:

import json
trumpPOTUS_members = []
rwb_trump_members = []
woo_members = []
DATA_DIR = 'data/fb_group_member_lists/'
with open(DATA_DIR + 'TrumpPOTUS1.json','r') as f:
member = json.loads(f.read())
print(len(member['TrumpPOTUS']))
member_set = {frozenset(item.items()): item for item in member['TrumpPOTUS']}.values()
print(len(member_set))
trumpPOTUS_members = {v['link']: v for v in member_set}
f.close()

with open(DATA_DIR + 'RedWhiteandWETHEPEOPLE.json', 'r') as f:
member = json.loads(f.read())
print(len(member['Red_White_and_WE_THE_PEOPLE']))
member_set = {frozenset(item.items()): item for item in member['Red_White_and_WE_THE_PEOPLE']}.values()
print(len(member_set))
rwb_trump_members = {v['link']: v for v in member_set}
f.close()

with open(DATA_DIR + 'WoothePeople.json', 'r') as f:
member = json.loads(f.read())
print(len(member['Woo_the_People']))
member_set = {frozenset(item.items()): item for item in member['Woo_the_People']}.values()
print(len(member_set))
woo_members = {v['link']: v for v in member_set}
f.close()

# Calculate the mutual group members between the three groups
mutual_friends = {'TrumpPOTUS-RWBTrump': [], 'TrumpPOTUS-Woo': [], 'RWBTrump-Woo': [], 'all': []}
mutual_friends['TrumpPOTUS-RWBTrump'] = set(trumpPOTUS_members.keys()) & set(rwb_trump_members.keys())
mutual_friends['TrumpPOTUS-Woo'] = set(trumpPOTUS_members.keys()) & set(woo_members.keys())
mutual_friends['RWBTrump-Woo'] = set(rwb_trump_members.keys()) & set(woo_members.keys())
members_set_list = [set(trumpPOTUS_members.keys()), set(rwb_trump_members.keys()), set(woo_members.keys())]
# unpack the list of sets and pass all of them as arguments into intersection()
mutual_friends['all'] = members_set_list[0].intersection(*members_set_list)

print({type(intersect) for intersect in mutual_friends})
print(len(mutual_friends['TrumpPOTUS-RWBTrump']))
print(len(mutual_friends['TrumpPOTUS-Woo']))
print(len(mutual_friends['RWBTrump-Woo']))
print(len(mutual_friends['all']))
print(len(trumpPOTUS_members))

nodes_edges_output_list = {"nodes": [], "edges": []}

# Expects input of the form {'group1_name': {'link': group1_link, 'members': group1_members_dict},
# 'group2_name': {'link': group2_link, 'members': group2_members_dict}, ...}
full_members_dict = {**trumpPOTUS_members, **rwb_trump_members, **woo_members}
for value in full_members_dict.values():
value["id"] = shortuuid.uuid()
nodes_edges_output_list["nodes"].append(value)

groups_dict = {
"https://www.facebook.com/groups/[REDACTED]/": {
"name": "PRESIDENT DONALD TRUMP",
"id": shortuuid.uuid(),
"members": trumpPOTUS_members,
"link": "https://www.facebook.com/groups/[REDACTED]/",
"labels": ["Group"]},
"https://www.facebook.com/groups/[REDACTED]/": {
"name": "RED, WHITE, and WE THE PEOPLE back the BLUE & PRESIDENT TRUMP",
"id": shortuuid.uuid(),
"members": rwb_trump_members,
"link": "https://www.facebook.com/groups/[REDACTED]/",
"labels": ["Group"]
},
"https://www.facebook.com/groups/[REDACTED]/": {
"name": "Woo the People",
"id": shortuuid.uuid(),
"members": woo_members,
"link": "https://www.facebook.com/groups/[REDACTED]/",
"labels": ["Group"]}
}



for grp in groups_dict.values():
source_id = grp["id"]
nodes_edges_output_list['nodes'].append(grp)
for k, v in grp["members"].items():
# Here, rather than simply taking the dictionary item I've been given directly, I will
# use the item's key to retrieve the corresponding entry from full_members_dict (where there are no duplicate member entries),
# which contains the necessary id property for the member
member_node_id = full_members_dict[k]["id"]
nodes_edges_output_list['edges'].append({"source": source_id, "target": member_node_id, "join_date": v.get('joined')})


with open('data-temp/network.json', 'w') as f:
json.dump(nodes_edges_output_list, f, indent=4)

Data Processing and Enrichment: The Challenges of Multilingual Data

In the previous sections, I discussed how I collected data from various sources and prepared it for analysis. Now, let’s dive into the next stage of the project: data processing and enrichment. This stage was particularly challenging due to the multilingual nature of our data and the limitations of the libraries we used.

Facebook Profile Processing

I started by reading a JSON file containing the Facebook profile data. My goal was to perform language detection and named entity recognition on the profile descriptions. To do this, I initially used the spacy library to build our NLP pipeline. However, I soon discovered that spacy could not detect and classify some Southeast Asian languages such as Vietnamese that were prevalent in our dataset.

This led me to switch to the stanza library, which supports more languages. However, even stanza did not support all the languages present in the data, such as Thai.

I attempted to train a new custom model to recognize these unsupported languages, but this proved to be a time-consuming and ultimately unsuccessful effort. Training the new model on a CPU-only laptop took over 10 hours, which was a productivity-killing non-starter. I finally succeeded in training it using a Google Cloud Platform virtual machine with 4 GPUs, but then I could not get stanza to consume the new model in place of the default model.

In the end, I opted to manually correct the language codes for the unsupported languages in our dataset. This was a pragmatic solution that allowed us to move forward with our analysis.

I also extracted location and occupation information from the profile descriptions using regular expressions and named entity recognition. The processed data was then saved to a new CSV file. See below:

import stanza
from stanza.pipeline.multilingual import MultilingualPipeline

stanza.download("multilingual")

nlp_multi = MultilingualPipeline(lang_id_config={
"langid_clean_text": False,
"langid_lang_subset": ["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "th", "tr", "vi" ],
},
lang_configs={
"en": {"processors": 'tokenize, pos, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"ar": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"es": {"processors": 'tokenize, pos, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"pt": {"processors": 'tokenize, pos', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"be": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"bg": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"he": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"id": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"ko": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"th": {"processors": 'tokenize', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"tr": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES},
"vi": {"processors": 'tokenize, ner', "download_method": stanza.DownloadMethod.REUSE_RESOURCES}
},
ld_batch_size=85,
max_cache_size=15 )

lang_detector = stanza.Pipeline(lang="multilingual", processors="langid", langid_lang_subset=["en","ar", "es", "pt", "be", "bg", "ko", "id", "he", "ru", "tr", "vi" ])

with open("data-temp/network.json", "r", encoding="utf-8") as f:
content = json.loads(f.read())
fb_df = pd.DataFrame(content.get("nodes"))

fb_df.drop(columns=["members"], inplace=True)
import regex #important note: I'm choosing this library instead of the standard re library because some of the description values that follow the 
#"Works at" pattern have 2 or more whitespace characters between the 'at' and the next word. Dealing with this requires making the positive
#lookaround variable-length, which is not supported in the default Python regex engine. Hence the different engine library choice.
from numpy import nan

regex_pattern = r"\b(?<=\bat\s{1,2}?\X{0,1})\p{L}.*$"

def apply_and_concat(dataframe, field, func, column_names):
return pd.concat((
dataframe,
dataframe[field].apply(
lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

def apply_and_concat_external(target_dataframe, source_series, func, column_names):
return pd.concat((
target_dataframe,
source_series.apply(lambda cell: pd.Series(func(cell), index=column_names))
), axis=1)

def get_lang_and_entities(doc: stanza.models.common.doc.Document):
location_results = []
occupation_results = []
misc_results = []
other_results = []
untagged_results = []
if len(doc.ents) > 0:
for ent in doc.ents:
if ent.type in ['LOC', 'ORG', 'GPE', 'LC', 'OG', 'LOCATION', 'ORGANIZATION', 'GPE']:
location_results.append(ent.text)
elif ent.type in ['FAC', 'PRODUCT']:
occupation_results.append(ent.text)
elif ent.type in ['PERSON', 'PER', 'PS', 'MISC', 'MISCELLANEOUS']:
misc_results.append(ent.text)
else:
other_results.append(ent.text)
else:
works_at_rule_results = regex.search(regex_pattern, doc.text, flags=regex.UNICODE)
if (works_at_rule_results is not None):
start, end = works_at_rule_results.span()
span = doc.text[start:end]
untagged_results.append(span)
results_tuple = (
(doc.lang),
(', '.join(location_results) if len(location_results) != 0 else nan),
(', '.join(occupation_results) if len(occupation_results) != 0 else nan),
(', '.join(misc_results) if len(misc_results) != 0 else nan),
(', '.join(other_results) if len(other_results) != 0 else nan),
(', '.join(untagged_results) if len(untagged_results) != 0 else nan)
)
return results_tuple


docs_series = fb_df['description'][fb_df['description'].notna()]
docs_list = docs_series.to_list()
langed_docs = nlp_multi(docs_list)
for doc in langed_docs:
print("---")
print(f"text: {doc.text}")
print(f"lang: {doc.lang}")
print(f"ents: {doc.ents}")
print(f"{doc.sentences[0].dependencies_string()}")
langed_series = pd.Series(langed_docs, index=docs_series.index)
docs_series = apply_and_concat_external(docs_series, langed_series, get_lang_and_entities, ['lang', 'location', 'occupation', 'misc_ents', 'other_ents', 'untagged'])
fb_df = fb_df.join(docs_series.loc[:, ['lang', 'location', 'occupation', 'misc_ents', 'other_ents', 'untagged']], how="outer")
fb_df.to_csv("data-temp/fb_data_langid_semiprocessed.csv", sep='\t', encoding='utf-8-sig')

My struggles with the limitations of these NLP libraries and the multilingual nature of the data highlight the challenges of working with real-world, multilingual data. However, they also taught valuable lessons about the complexities of data processing and enrichment in a multilingual context.

The Jupyter notebook data-processing-enrichment.ipynb containing the full code for this miltilingual NLP data processing pipeline can be read in full at this link to the Github repository.

Overcoming Challenges in Multilingual NLP and Geocoding

In the previous section, I discussed the challenges we faced in processing and enriching our multilingual data. In this section, we will delve deeper into the issues I encountered and the solutions I implemented.

Language Identification and Named Entity Recognition

As mentioned previously, my first challenge was integrating the tweaked bi-LSTM language identification model, which was re-trained to recognize Thai, back into the Stanza multilingual pipeline. Despite my best efforts, I was unsuccessful in making this integration work. As a workaround, I decided to manually correct the mislabeled Thai, Lao, and other rows in our dataset.

However, as I went through the rows during exploratory data analysis, I noticed other inaccuracies in the language classifications and Named Entity Recognition (NER) results from the model. For instance, I found a row labeled as Spanish (ES) that was actually written in Italian. Another row was labeled as Portuguese (PT) when the description was actually a location in Romania. Many US-based locations were misclassified as non-English, especially Indonesian.

I also observed that English text styled with unusual ASCII characters was often misclassified as Belarusian. This was a common trend in the dataset and required further analysis.

During the correction process, I discovered additional languages in the dataset, including Sinhalese, Burmese, Khmer, Gujarati, and Georgian. This highlighted the complexity and diversity of our multilingual data.

In total, I found the following languages represented in our dataset from the 3 domestic American political Facebook groups:

  • en (English)
  • es (Spanish)
  • pt (Portuguese)
  • id (Indonesian)
  • vi (Vietnamese)
  • be (Belarusian)
  • bg (Bulgarian)
  • ro (Romanian)
  • ru (Russian)
  • ar (Arabic)
  • tr (Turkish)
  • th (Thai)
  • ko (Korean)
  • lo (Lao)
  • he (Hebrew)
  • zh (Chinese)
  • ua (Ukrainian)
  • ka (Georgian)
  • gu (Gujarati)
  • my (Burmese)
  • km (Khmer)
  • si (Sinhalese)

At this point, my NER extraction step was only capturing named entities from about 58% of the rows that had description values. This was a significant issue that needed to be addressed.

To improve the NER extraction, I expanded the NER tagging function to extract more uncategorized named entities, including those not recognized as named entities by the pipeline processors. We added additional columns to our dataframe for “other_ents” and “untagged” entities. This increased the number of rows with extracted entities, improving the coverage of the NER extraction.

Dealing with Cultural Context

While processing non-Western language descriptions, I encountered cultural nuances that required additional context to understand. For example, I found 4 different Thai profiles with the description “บริษัท พ่อกับแม่ จำกัดมหาชน”, which translates to “Father and Mother Public Company Limited.”

Upon further investigation, I found a couple of posts on a Thai-speaking forum, asking about such profiles. The users there seemed to perceive it as a Thai Facebook trend and wondered if it was a joke (source 1, source 2).

This highlighted the importance of understanding cultural context when processing multilingual data. Without this context, we might misinterpret the data or miss important nuances.

Finalizing the Geocoding Process and Visualizing the Data

After months of work, I finally completed the objective of geocoding the accounts with Description fields and mapping them onto a latitude-longitude geospatial distribution visualization in ArcGIS. However, this process was not without its challenges and required several iterations to refine the accuracy of the geocoded data.

Implementing the Geocoding Function

I started by creating a C# Azure Function skill for extracting latitude-longitude coordinates from a text string with a location name. This function wrapped the Bing Maps API and was deployed into a personal Azure cloud account, building on the code for the GetGeoPointFromName Azure Function skill from the azure-search-power-skills GitHub repository.

Next, I created a Python script that called this Azure function. The script was designed to extract geocoordinates from the dataframe and enhance the dataframe with the resulting geospatial information. It did this by making a call to the external API provider for each record or group of records, adding the response to a list of processed responses, and then joining this list back onto the dataframe.

import pandas as pd
import httpx
import asyncio
from typing import Dict, List, Any
import json
import numpy as np
from nullsafe import undefined, _


'''
Task: Extract geocoordinates from the dataframe and enhance the dataframe with the resulting geospatial information.

Step 1: For each record or group of records, make a call out to an external API provider. The API provider response should
include a latitude and longitude coordinate for the supplied location description. Add it to a list of processed
responses.
Step 2: When the list of response coordinates is complete, join it back onto the dataframe.
Step 3: With the expanded dataframe of coordinates, save it to another intermediary CSV file for creating a new spatially-enabled dataframe that includes the Shape data
object for each lat-long coordinate pair.
'''


url = "https://knowledgeminingcustomskills-dev.azurewebsites.net/api/geo-point-from-name?code=<authorization code>"


async def get_geocoordinates(name: str):
'''Get the geocoordinates for a given location name from an Azure Function API.'''
if name and not name.isspace():
input = {"values": [{"recordId": "something", "data": {"address": name}}]}
try:
resp = await httpx.post(url, json=input)
resp.raise_for_status()
response = resp.json()
except httpx.HTTPError as http_err:
print(f"Error has occurred: {http_err}")
except json.decoder.JSONDecodeError as json_err:
print(f"Error has occurred: {json_err}")
finally:
return response
return

async def get_geocooordinates_batched_values(request: Dict):
try:
resp = httpx.post(url, json=request, timeout=httpx.Timeout(15.0))
resp.raise_for_status()
return resp.json()
except httpx.HTTPError as http_err:
print(f"Error has occurred: {http_err}")
except json.decoder.JSONDecodeError as json_err:
print(f"Error has occurred: {json_err}")


async def get_geocoordinates_batched(names: pd.Series): #change the parameter type to Series
'''Get the geocoordinates for a list of location names from an Azure Function API.'''
input_values = []
for index, name in names.items():
if (name and name is not np.nan and not name.isspace()):
input_values.append({"recordId": index, "data": {"address": name}})
input = {"values": input_values}
response = await get_geocooordinates_batched_values(input)
if response is not None:
response_values = response["values"]
parsed_records = pd.DataFrame.from_records(response_values, index="recordId")
parsed_records.index = parsed_records.index.astype(np.int64)
coord_df = pd.json_normalize(parsed_records["data"])
coords = [x if isinstance(x, list) else [x] for x in coord_df["mainGeoPoint.coordinates"]]
split_df = pd.DataFrame(coords, index=parsed_records.index, columns=['latitude', 'longitude'])
parsed_records = pd.concat([parsed_records, split_df], axis=1)
return parsed_records

async def find_lat_longs(locations: pd.Series):
temp_list = pd.DataFrame(columns=["latitude", "longitude"])
batchsize = 50
for i in range(0, len(locations), batchsize):
batch = locations.iloc[i:i+batchsize]
resp = await get_geocoordinates_batched(batch)
if resp is not None:
temp_list = temp_list.combine_first(resp[['latitude', 'longitude']])

return temp_list


async def main():
df = pd.read_csv("data-temp/fb_data_langid_semiprocessed_v2.csv", sep="\t")
#df = df.drop(columns=["members"])
long_lat_list = await find_lat_longs(df["location"])

output = df.join(long_lat_list)
output.to_csv("data-temp/fb_data_geocoordinates.csv", sep="\t", encoding="utf-8-sig")


if __name__ == "__main__":
asyncio.run(main())

Refining the Geocoded Data

After obtaining the geocoded data, I further refined its accuracy by re-geocoding the data in latitude-longitude ranges that we manually identified as low-accuracy from exploratory data analysis. We used the Google Geocoding API to do this, and then fed the refined data into an ArcGIS geospatial map for visualization.

However, I found that at least 25% of the latitude-longitude coordinates remained wildly inaccurate. We opted to clean some of the remaining inaccurate data using the OpenRefine open-source tool.

Cleaning the Data

I made minor edits to the language, location, and occupation columns of various rows, replacing incorrect language codes with their correct ISO codes, moving location values that were actually professional occupations into the occupation column, and expanding some short location values with more complete addresses from Google & Bing.

For rows where the location was the name of a company with many offices across its country of operations, such as USAA, Pepsi, and The Bank of Brazil, I removed the latitude-longitude coordinates due to the high ambiguity. I did the same for location values in non-English languages that were actually joke text. In some cases of location ambiguity, I consulted the corresponding Facebook profile to check which region they were based in. In other cases, where there were several possibilities but not many, I picked one, which worked for geolocating to the country level but would be inaccurate for more granular analysis.

Visualizing The Geographic Distribution of Accounts

Finally, I utilized the ArcGIS Python API to create point-based maps of the geographic distribution of these accounts, as well as a point-density heatmap. The results are shown below.

The Global Reach of Disinformation

In the case of the TrumpPOTUS group, a significant number of these spammers were identified as originating from countries like the Philippines, Indonesia, India, and regions such as Hong Kong and Taiwan. My own analysis of the dataset largely corroborates these findings, with a high prevalence of accounts claiming to be located in Indonesia, the Philippines, and India.

Interestingly, I also found a high prevalence of accounts coming from Vietnam, the Democratic Republic of Congo (DRC), South Africa (chiefly in Johannesburg), Seoul, and various cities and countries in South America, especially Brazil, Colombia, Bolivia, and Argentina.

Other, smaller hotspots were in Cambodia, Thailand, Malaysia, Madagascar, and Australia, as well as scattered throughout Europe.

The Curious Case of “The Krusty Krab” and “Facebook App”

While analyzing the dataset, I stumbled upon some peculiar data. The two most peculiar of these were the high frequency of the locations “The Krusty Krab” and “Facebook App” as listed in the Description field. There were 66 accounts claiming to be “at Facebook App” in the dataset, and 60 accounts claiming to be “at The Krusty Krab” (or some small variation thereof). These frequencies are about as high as that of all the accounts from Americans saying they were based in the New York City area combined.

This oddity in the dataset raises several questions. Why would so many accounts claim to be located at “The Krusty Krab,” a fictional restaurant from the animated television series SpongeBob SquarePants? And what does it mean to be “at Facebook App”? These questions warrant further investigation and could potentially reveal new insights into the tactics and motivations of these spammers.

Future Work

In this article, we built an end-to-end data processing pipeline for multilingual language identification, geocoding, and geospatial distribution visualization. In future articles, I will explore additional investigative questions about this dataset and the full SCOTCH component chain, including but not limited to:

  • Why are so many accounts in these three groups claiming to work at Facebook App and The Krusty Krab? We can agree that this data has a suspicious smell — could these accounts be part of a coordinated inauthentic network used by spammers or other disinformation actors, or is it merely a (very weird but innocuous) coincidence?
  • Once shared into these Facebook group audiences, how much traction and broader engagement do articles from thedcpatriot[dot]com and Lancaster generate? Are they having a measurable impact on their target audience?
  • Can we assess with sufficiently high confidence whether this suspected coordinated inauthentic sharing of Couch content is being done with the objective of serving his website’s financial interests, or if the objective of the operation runs deeper than that?

The Impact of Disinformation

The impact of this disinformation campaign is not to be underestimated. It sows discord and mistrust among group members, leading to divisions and infighting. It also undermines the credibility of the group and its messages, as members struggle to distinguish between genuine posts and spam.

Moreover, the presence of these foreign spammers raises serious concerns about the integrity of our political discourse and the potential for foreign interference in our democratic processes. As we move forward, it is crucial that we develop effective strategies to combat this disinformation and safeguard our digital spaces.

--

--