Topic-modeling in Social Semantic Web
I read a very interesting paper on nautil.us this week about digital humanities and the problem of topic-modeling. Although the paper (here) focuses on literature rather than social web, I wanted to share some ideas it inspired me related to this field.
What is topic-modeling?
To spare you the long-read of the article I was referring to (that you should read, still), I will first explain what topic-modeling is.
In common languages, one word can have many different meanings depending on the context of use. It is basically what we call polysemy: for example, the word “wood” can refer to a piece of tree (the material wood) or to a small forest. This is a major problem in computer-aided semantic analysis because when searching for a specific word, we don’t want “parasite meanings” of this word to mix with the results of our query.
The idea is then to characterize every word by its context, the words that are used around and in the whole text. If we can determine the general topic (or topics) of the text, then we can quite correctly infer the use of the word in this text. This is what topic-modeling is.
Topic-modeling algorithms are generally machine learning algorithms. The article talks about the most used technic called the latent Dirichlet allocation which uses bags of words for each topic and a simplified probabilistic model with percentages for the different topics that make up the text. Every word can then be linked to the correct semantic field it is used in this context.
Application in Social Semantic Web
The use of topic-modeling in literature is very convenient, the more text (that is “context”) you have, the more accurate is the characterization of the topics involved in the document, and the more accurate will be the result of a specific query. When dealing with social web, this isn’t that easy.
Of course, there are plenty of data, a huge amount of content available for the algorithms, but those are usually different “pieces” of text not necessarilly related one to another. When making a specific word query on Twitter, every tweet can have the word used in a different meaning, this actually happens quite often. To overcome this, the machines doing the query would need a general understanding of the meaning of every tweet (otherwise we have to be more specific in our own query, but that filters a lot of valuable content).
It seems difficult, but I don’t think it is impossible to at least approach the result and have a general idea of how the word of the query is actually used in a tweet. Of course, if the tweet has hashtags or a link in it, it helps a lot. Otherwise, we could first consider analysing the user’s other tweets, especially those chronologically close, to find the related topic. Sometimes people share series of tweets on the same subject, and usually a user has previously defined interests (based on all his timeline) that can help narrowing down the context. We can also analyse the global trends of the moment to give us information about a popular topic to which it can refer. Finally, we could take into account the reactions (mentions and manual retweets, but also the interests of the people that fav or RT the tweet) that the tweet had to infer its broad meaning and the use of the word queried.
Obviously, Twitter is one of the most complicated (and challenging) example of topic-modeling in Social Semantic Web because of its 140 characters limitation. Platforms like Facebook or Tumblr allow users to post longer content as well as a bigger list of hashtags which helps a lot in this kind of analysis. Still, MIT researchers seem ready to tackle this problem, with the help of Twitter itself.