Order from Chaos: Understanding Search Queries through Vectors

The Query2Vec pipeline and a foray into word embeddings

Myra Cheng
Sep 13, 2019 · 6 min read

How do learners browse around as a result of different searches? What topics are they looking for that we currently don’t provide? By examining search queries and their patterns, we can gauge learners’ interests and improve the site experience.

Sifting through queries to understand how our content is discovered, I quickly realized the difficulty of this task. Coursera gets millions of searches every day, so it’s hard to directly analyze them.

This led me down a rabbit hole of learning about word embeddings and building a pipeline to turn search queries into vectors that represent their relationships to one another. In this representation, for example, the terms “machine learning” and “artificial intelligence” are closer together than “machine learning” and “nutrition.”

Transforming open-ended data into quantitative embeddings opens up a whole world of usage. I focused on identifying unmet demand in our course catalog, but other applications include investigating differences in broad vs. specific searches or evaluating which universities learners prefer. These vectors are also valuable as signals in prediction and classification models.

How Query2Vec Works

Image for post
Image for post

Preprocessing

The raw queries contain all sorts of inconsistencies:

Image for post
Image for post

We could scan every query for the phrase “deep learning,” but that would exclude some of the more idiosyncratic (or simply typo-ridden) queries — and this is a tiny fraction of the diversity of free-text queries. I resolved this issue using natural language processing (NLP) techniques like lowercasing letters and discarding extraneous characters. Stopwords, common words like “a” or “the,” are removed. Each phrase is then converted into an array of words for use in the rest of the pipeline.

Image for post
Image for post

Vectorizing

The pipeline relies on the concepts of Google’s Word2Vec, in which the model learns how to turn each word into a vector that retains its logical, grammatical, and linguistic relationships. It deduces these relationships from a corpus with billions of words, such as the entirety of Wikipedia.

The algorithm starts with randomly initialized vectors in an unsupervised training approach based on cosine similarity (the cosine of the angle between the two vectors). Since cos(90º) is 0, while cos(0º) is 1, orthogonal vectors have the least similarity, while identical vectors share the most.

With every sentence on the training set, the model adjusts vectors for the words based on their contexts in sentences, making them slightly closer to one another if they appear together and farther apart if not. Furthermore, the algorithm captures different types of similarities, shifting the vectors so that, for example, “doctor” is to “doctors” as “engineer” is to “engineers,” yet the connections that “doctor” has to “hospital” and “nurse” are kept. Gradual changes over a large dataset transform the random vectors into ones that embed the complex relationships between the words.

Another impressive feature of this approach is its low dimensionality, i.e. the ability to compress the differences among millions of words into a few hundred dimensions.

This is merely the tip of the iceberg of word embeddings, which is an entire field of study! I decided to use the open-source FastText model created by Facebook Research to generate the vectors since it has the most flexibility in parsing unknown words and phrases. This model breaks each word — which is a sequence of characters — down into its character n-grams. Thus, for words that are outside of the model’s vocabulary, we can still construct a vector for it based on the character n-grams. This makes it more versatile than the original Word2Vec, where the smallest unit is a single word.

For queries that are phrases, I assigned them to a vector that is the normalized mean of its word vectors. This is inspired by what FastText does in the source code for its “most similar search” capability.

The culmination of these years of research has turned our queries to vectors. At this point, they are useful as input features for various machine learning models, or to be further analyzed, as explained below.

Image for post
Image for post

Mapping to Features (Similarity Search)

We can map these query embeddings to any set of words based on their numerical similarities. Coursera has a database of thousands of skills, many of which are tied to course offerings. For example, the course Becoming a Changemaker: Introduction to Social Innovation by University of Cape Town teaches the skills “social entrepreneurship,” “sustainability,” and “resource management.” Besides connecting queries to this database of skills, this technique is widely extensible to other mappings, like with a list of different universities or words related to difficulty levels.

To compute the similarities between these two sets of vectors, the queries and the skills, I use Faiss, a library of C++ algorithms that stores the vectors in a huge matrix and builds an index of clusters based on their similarity.

This enables a precise yet efficient calculation of a query’s “nearest neighbors” across the set of skills, i.e. skills with the most similar embeddings, by focusing on clusters.

Image for post
Image for post
Image for post
Image for post

Analysis Possibilities

Once the queries are tagged with skills, they can be easily clustered and analyzed.

Image for post
Image for post

Let’s take a look at the skills whose associated queries lead to the highest rates of course enrollment.

Image for post
Image for post

Why does psychology enjoy so much attention from searchers yet have such a low enrollment rate? The top results from a search for “psychology” on Coursera are introductory courses and positive psychology courses. But psychology is a broad topic — could searchers want to learn the biological or computational aspects of the field? This requires further investigation of the data behind the queries tagged with “psychology.” This type of analysis clues us into the unmet needs of learners, enabling us to develop more relevant content to help learners achieve their goals.

These methods and statistics provide a fresh perspective into what learners are searching for, which has applications in understanding unmet demand. Developing new course content, improving course branding, and fine-tuning our recommendation systems are only a few of the possibilities. I also set up scripts to vectorize, tag, and analyze search queries daily for use in our rankings and recommendations models.

Although my exploration is in English for easy sanity checking, the Query2Vec pipeline can be used in other languages and limitless other contexts.

Acknowledgments

This project wouldn’t have been possible without Jaya Chavern’s mentorship and Allie Rogers’s search-related insights. Thank you to the many people across the Data Science and Engineering teams whose support and excitement gave my work momentum and purpose.

About the Author

Myra is an undergraduate student at Caltech. She was an intern at Coursera in the summer of 2019. Her curiosity about how the company makes data-driven decisions inspired this independent project, which she designed, explored, and built during the final three weeks of her internship.

If you’re interested in applying data science or engineering to advance the future of education, check out our open roles here.

Coursera Engineering

We're changing the way the world learns!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store