Role of AI in Newsrooms

Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. The problem of fake news detection is more challenging than detecting deceptive reviews, since the political language on TV interviews, posts on Facebook and Twitters are mostly short statements. Machine Learning model is being used in the journalism space to minimise the disagreement by the readers.
Internet is a medium with the ability to deliver messages virtually everywhere in the world. However, different governments control such freedom of expression and thus, the reach of the content. There is a survey study on how social media led campaigns results in government changes or huge uprisings. This leads stakeholders to develop software and tools that can detect in advance and monitor such activities in real-time. On the other hand, social media can be used to help organizations for better governance. We have seen organizations taking note of viral content on the Internet. Such systems can be designed to highlight the public sentiments and opinions about the issues and government policies where political parties can use real-time opinion/sentiment mining systems.
If we need to identify whether a news is fake, biased or misleading, we will find correlation between the signals described above by using Regression Analysis. Regression analysis helps us in understanding how the value of the output variable changes when we vary some input variables while keeping other input variables fixed. In linear regression, we assume that the relationship between input and output is linear. This puts a constraint on our modelling procedure, but it’s fast and efficient. Sometimes, linear regression is not sufficient to explain the relationship between input and output. Hence we use polynomial regression, where we use a polynomial to explain the relationship between input and output. This is more computationally complex, but gives higher accuracy. Depending on the problem at hand, we use different forms of regression to extract the relationship.
ML engine for Clickbait Detection and Content Relevance is built on three components. The first leverages neural networks for sequential modelling of text. The article title is represented as a sequence of word vectors and each word of the title is further converted into character level embeddings. These features serve as input to a bidirectional LSTM model. An affixed attention layer allows the network to treat each word in the title in a differential manner. The next component focuses on the similarity between the article title and its actual content. For this, we generate Doc2Vec embeddings for the pair and act as input for a Siamese net, projecting them into a highly structured space whose geometry reflects complex semantic relationships. The last part of this system attempts to quantify the similarity of the attached image, if any, to the article title. Finally, the output of each component is concatenated and sent as input to a fully connected layer to generate a score for the task.
Hate speech detection is done by using NLP ( Natural Language Processing ) technique called Tf-IDF vectorization. Then using the machine learning technique called logistic regression, the computer is trained to classify hate speech using the training data. Sentiment Analysis and language polarity detection is also done by using the same NLP ( Natural Language Processing ) technique called Tf-IDF vectorization. Feeding a logistic regression with these vectors and training the regression to predict sentiment is known to be one of the best methods for sentiment analysis, both for fine-grained (Very negative / Negative / Neutral / Positive / Very positive) and for more general Negative / Positive classification.
Among traditional machine learning models, the most accurate in classifying abusive language is the Logistic Regression (LR) model. Unsupervised machine learning techniques like Clustering, Latent Semantic Indexing (LSI) and Matrix Factorization are used to train the model to search the relevant content. Automatic Text Categorization and Semantic analysis helps to analyze various trends in the article and how the story has developed over the last six months based on comparing with the stored articles or searching the web for relevant articles and arranging them in chronological order. Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are predicted based on provided data).
Twitter is just one social media platform, and additional research on other platforms including LinkedIn and Facebook would be informative and allow for additional analytical lenses including inter-network analysis, social network analysis, peak detection analysis, and topic analysis. Such research could be further supplemented with primary qualitative research to understand the processing depth of socio-political engagement and secondary research using material shared on Twitter (reports, infographics, images, videos etc.) but also primary destination sites for these materials e.g. corporate websites, blogs, traditional media and other social networks. While the study focused largely on one firm-level antecedent (size), one industry (accounting), and one issue (Brexit), there is a rich stream of research opportunities available including additional antecedents, socio-political engagement types, how firms organise for socio-political engagement, and the outcomes of such socio-political engagement.
There are 7 methods by which we can detect bots from humans:
- Abnormal Account Activity
- Ratio of Engagements
- Followers versus Engagements
- Follower Origin
- Follower Bias/Copy
- Percentage of Followers with Newly Created Accounts
- The Retweet Test
There are a wide variety of machine learning algorithms that can be directly used in the context of Malicious URL Detection. After converting URLs into feature vectors, many of these learning algorithms can be generally applied to train a predictive model in a fairly straightforward manner. The learning algorithms can be categorized into Batch Learning Algorithms, Online Algorithms, Representation Learning, and Others. Batch Learning algorithms work under the assumption that the entire training data is available before the training task. Online Learning algorithms treat the data as a stream of instances and learn a prediction model by sequentially making predictions and updates. This makes them extremely scalable compared to batch algorithms. Due to potentially a tremendous size of training data (millions of instances and features), there was a need for scalable algorithms, and that is why Online Learning methods have found a lot of success in this domain.
The world of social media analytics is vast with many domains to explore. Implementing the researched metrics may give us a much better insight into the real Twitter influencers involved. We may also be interested in judging the power of Twitter as a media of influence by measuring the opinions of followers of an influencer before and after the influencer’s tweets. In the area of sentiment analysis, we may attempt to obtain a higher classification accuracy by including features other than word embeddings itself, such as users who retweeted the tweet or location of tweet origin.
