In a nutshell: Machine learning for text mining

Isabell Claus
thinkers.ai
Published in
2 min readJan 31, 2019

--

Machine learning is the key to create sets of rules for the content analysis of large amounts of texts. There are different approaches to achieve this mission. Here is an overview of what`s state-of-the art.

In order to interpret language, semantic analysis aims to „translate“ rule sets of a language (grammar, spelling, common phrases, linguistic peculiarities like sarcasm for example) into programming code and trains machines to correctly interpret language based on such rule sets.

A second state-of-the-art approach for language interpretation is to derive the necessary context information out of large amounts of data, which represent a good cross section of all characteristics of a language.

Although semantic analysis delivers good results today, the difficulty arises with the creation of a complete set of rules for a language as the continuous evolution of a language happens over centuries, does not follow rules and includes tons of exceptions.

Extensive experience and an enormous amount of trial and error-studies and empirical work delivered the solution to this problem. It appeared that „direct learning” based on an adequate amount of data and well-selected and tested models works very well without intensively programming rules for grammar, etc.

The result of such a technology is known as „word vectors“, more precisely „high dimensional, dense vectors“ which represent the characteristics of words very well (polysemy, degree of similarity to other words, relationships with other words, etc.). Such word vectors are combined to sentences and sentence vectors. In a next step content covered by the sentences (text parts, articles and so on) are isolated. This is how similar content between one and another text is detected. Moreover it empowers us to ask questions to machines like „Can you find text sections which include content similar to a segment which I show you here?“ This gets us close to our habitual way of human communication.

--

--