The Sparse Matrix with NLP

Shachia Kyaagba
4 min readAug 9, 2018

--

Natural language processing is a branch of computer science and artificial intelligence that is concerned with the processes by which computers ‘understand’ human natural language. It’s ultimate objective is to seamlessly blur the gap between human communication and computer understanding.

In this post I will be focusing on the natural language in the form of text. There is so much text data that has been generated over millenia that it just isn’t feasible for human beings to manually go through all this information to extract insight. The natural thought process then turns to computers to explore this opportunity. If computers have consistently made other areas of life more efficient (think calculators, cars, trains, medicine, etc), why then can’t it also be applied to text processing?

The two major challenges were:

  1. Analyzing very large amounts of data in any form is computationally tasking for a computer.
  2. Computers only understand and communicate in numbers. As such, in order to be capable of analyzing text, the text first needs to be converted into a form the computer can understand (numbers!)

For the first challenge, the advent of parallel computing and the Graphics Processing Unit (GPU) enabled faster and more efficient information processing. (topic for another post)

The second challenge was solved with a math concept known as the sparse matrix. In order to understand what a sparse matrix is we need to understand what a matrix is.

Enter the Matrix…

Simply put, a matrix in mathematics is any type of information that is presented in the form of rows and columns (think excel spreadsheet). Now a sparse matrix is a matrix whose rows and columns are predominantly filled with zeros.

SPARSE MATRIX

Now someone, somewhere, sometime, made a connection between text and the sparse matrix. The idea was that if we could somehow convert text into the format above then a computer would be able to analyze it and spit out insight (beautiful isn’t it?). Now bear with me for a second while we go through the different steps of converting text to the form of the sparse matrix above.

Computers LOVE structure and text, to say the least, isn’t. Our objective is to first transform text to a more structured format then convert the structured text into a sparse matrix.

Consider the following sentences:

  1. Mary, is hungry for apples.
  2. John is happy he is not hungry for apples.

Step 1: Get rid of all punctuation marks.

  • > Mary is hungry for apples
  • > John is happy he is not hungry for apples

Step 2: Make all letters lowercase. Computers distinguish between uppercase and lowercase and as such will treat ‘A’ and ‘a’ differently.

  • > mary is hungry for apples
  • > john is happy he is not hungry for apples

Step 3: Separate each word in each sentence so that it becomes its own entity. This is a process known as tokenization and the resultant individual words are known as tokens

  • > [mary], [is], [hungry], [for], [apples]
  • > [john], [is], [happy], [he], [is], [not], [hungry], [for], [apples]

Step 4: Get all the unique words in both sentences and assign a particular position (index) to them.

  • > [mary], [is], [hungry], [for], [apples], [john], [happy], [he], [not]

Step 5: Convert each word to a number. For each word in each sentence, assign the assign the number of times the word occurs in that sentence to the word in question. This is a process known as count vectorization.

Picture by Emmanuel Ameisen of AI Insight

Voila!! The resulting combination of numbers for each sentence is the numerical representation of that sentence. If you then stack the the rows of numbers over each other you get your sparse matrix and a computer can go on to carry out whatever analysis you have in mind. A few examples of such analysis are:

  1. Sentiment Analysis: Identifying the opinions of people and groups through any media (text in this case)

2. Topic Modeling: Identifying core topics of a document through text analysis

3. Text Summarization. Summarize large texts for quick digestion.

4. Social media analysis. Analyze social media posts and discussions to . identify conversation trends and the sentiments behind them.

Computers are able to carry out numerical operations on our sparse matrix and convert the results back to words.

In conclusion, note that there are many other ways to carry out step 5 (e.g. tf–idf and hashing vectorizer) but the count vectorizer algorithm is the most intuitive to help grasp the concept of transforming text to numerical form for analysis. The next time you hear about an application that reads novels (ScriptBook), watches videos (dartfish), or listens and speaks to people (alexa), know that all end products begin with vectorizing the non-numeric data.

--

--