Predict propaganda twitter accounts using Machine learning on Google Cloud BigQuery

langton muchemwa

I am doing a series on predicting propaganda / Human Span / Spam Social bots on twitter.

I am using a https://en.wikipedia.org/wiki/Logistic_regression model on the BigQuery ML platform. I have a set of data that had been pre-classified with a filed bot = 0 /1 which we will use as our feature label for the model.

I loaded this data into a BigQuery table as below:

for the purpose of the post i will not go through loading data into BigQuery, i want to focus on ML and using the model.


Feature Correlation: I am using a couple features in this model, after a extensive evaluation i will drop some of the features that have a low contribution to the overall performance.

Feature list and computations:

friends_count,
followers_count,
listed_count,
statuses_count,
digit_count_in_name,
bot_in_name — regex search for bot in name,
len_url,
tweet_length,
accountage — status_count/ accountage,
activeness,
names_ratio,
followership — (followers_count / friends_count) ,
friendship — (friends_count, followers_count)

The choice of features has been guided by patterns observed in fake account and or their associations with other accounts, their tweeting patterns among others.

Step 1 Create the Model

in the code above i create a model in the Social dataset in my project space, the model name is ispopaganda. a lot of the string and Date handling in the script is due to the fact that i did not transform the data before loading it. I intend to use https://cloud.google.com/dataprep/ when i productionalize my model to handle the data prep side of things dynamically to ensure good results, but for now i will have to live with the code.

Step 2 Predict using our Model

Result

Step 3 Evaluate Model

One of the key steps is evaluating the performance of the model, depending on the desired KPI recall for your model you might have to tune your threshold and features until the model gives you the desired result. that being said we have to continuously evaluate our model this can be automated, we will look at this in later post, for now let go through hoe to get evaluation results using ML.EVALUATE and ML.ROC_CURVE functions.

For the (https://en.wikipedia.org/wiki/Receiver_operating_characteristic) ROC Curve we will execute the Query with SELECT * FROM ML.ROC_CURVE(MODEL …. option and click on explore with Data Studio to create a graph with Google’s free reporting tool. we will set the ( dimension = false_positive_rate and metric = recall) and the result is below.

WHATS NEXT

I am going to do a part 2 , where we will tune the model and serve the model to predict with real data that i have am collecting from twitter using Google connected sheets. I have glossed over a lot of detail, i will attempt to do a deep dive in part 3. If you have questions you can reach me on muchemwal@gmail.com

Credits: https://www.sciencedirect.com/science/article/pii/S0925231218308798

https://towardsdatascience.com/how-to-use-k-means-clustering-in-bigquery-ml-to-understand-and-describe-your-data-better-c972c6f5733b

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade