Spark-based machine learning tools for capturing word meanings

It is always amazing when someone is able to take a very hard, present day problem, and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. Text is unstructured data and has been explored mathematically far less than vectors — both historically, and today. Newton (1642–1726) may have been the first one to study vectors in the context of forces in physics, so vectors is a concept with at least 289 years of scientific maturity. Mathematical exploration of text data is a concept with only a few decades of maturity. Similarly, I have worked with vectors for more than half of my life, but only explored text data for less than a year.

The application of mathematical thinking to text data is especially important now, at a time when the value of data is understood, but not actualized. The majority of business-relevant information originates in unstructured form, primarily text. This data is invisible to, and unusable by, business, health care, education, and government, until it can be “read”. Mathematical exploration of text data can yield insights that translate into better decisions made by doctors, marketers, entrepreneurs, and teachers.

As part of my endeavor to make text data “readable”, I applied Word2Vec to generate vectors that capture word meaning, and enable arithmetic operations associated with words. For example, the vector(‘king’) + vector(‘woman’) — vector(‘man’) will result in a vector that is close to the vector(‘queen’). Isn’t this incredible? The Word2Vec method was proposed by Mikolos et al. in 2013. This algorithm is based on networks and maps a corpus of text to a matrix where each row is associated to a word in the input text data (for example, tweets, product reviews, playlists, …). The resultant vector space can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top N closest words.

For example, a model that I built using 30 days of Twitter data gives the 5 closest words to #deeplearning. They are: 1. #machinelearning, 2. #ml, 3. #smartdata, 4. #predictiveanalytics, 5. #datascience. The Word2Vec implementation used is the one from Spark ML, one of the machine learning package that’s part of Apache Spark.

If you’re interested in building your own Word2Vec model, take a look to this repo that runs on The Data Science Experience.

Originally published at on August 26, 2016.