Order from Chaos: Understanding Search Queries through Vectors

The Query2Vec pipeline and a foray into word embeddings

Myra Cheng
Coursera Engineering
6 min readSep 13, 2019

--

How do learners browse around as a result of different searches? What topics are they looking for that we currently don’t provide? By examining search queries and their patterns, we can gauge learners’ interests and improve the site experience.

Sifting through queries to understand how our content is discovered, I quickly realized the difficulty of this task. Coursera gets millions of searches every day, so it’s hard to directly analyze them.

This led me down a rabbit hole of learning about word embeddings and building a pipeline to turn search queries into vectors that represent their relationships to one another. In this representation, for example, the terms “machine learning” and “artificial intelligence” are closer together than “machine learning” and “nutrition.”

Transforming open-ended data into quantitative embeddings opens up a whole world of usage. I focused on identifying unmet demand in our course catalog, but other applications include investigating differences in broad vs. specific searches or evaluating which universities learners prefer. These vectors are also valuable as signals in prediction and classification models.

How Query2Vec Works

The Query2Vec algorithm pipeline. In this case, the features are a dataset of the skills that learners can gain from our courses.

Preprocessing

The raw queries contain all sorts of inconsistencies:

A few of the queries related to “deep learning.”

We could scan every query for the phrase “deep learning,” but that would exclude some of the more idiosyncratic (or simply typo-ridden) queries — and this is a tiny fraction of the diversity of free-text queries. I resolved this issue using natural language processing (NLP) techniques like lowercasing letters and discarding extraneous characters. Stopwords, common words like “a” or “the,” are removed. Each phrase is then converted into an array of words for use in the rest of the pipeline.

Queries after preprocessing.

Vectorizing

The pipeline relies on the concepts of Google’s Word2Vec, in which the model learns how to turn each word into a vector that retains its logical, grammatical, and linguistic relationships. It deduces these relationships from a corpus with billions of words, such as the entirety of Wikipedia.

The algorithm starts with randomly initialized vectors in an unsupervised training approach based on cosine similarity (the cosine of the angle between the two vectors). Since cos(90º) is 0, while cos(0º) is 1, orthogonal vectors have the least similarity, while identical vectors share the most.

With every sentence on the training set, the model adjusts vectors for the words based on their contexts in sentences, making them slightly closer to one another if they appear together and farther apart if not. Furthermore, the algorithm captures different types of similarities, shifting the vectors so that, for example, “doctor” is to “doctors” as “engineer” is to “engineers,” yet the connections that “doctor” has to “hospital” and “nurse” are kept. Gradual changes over a large dataset transform the random vectors into ones that embed the complex relationships between the words.

Another impressive feature of this approach is its low dimensionality, i.e. the ability to compress the differences among millions of words into a few hundred dimensions.

This is merely the tip of the iceberg of word embeddings, which is an entire field of study! I decided to use the open-source FastText model created by Facebook Research to generate the vectors since it has the most flexibility in parsing unknown words and phrases. This model breaks each word — which is a sequence of characters — down into its character n-grams. Thus, for words that are outside of the model’s vocabulary, we can still construct a vector for it based on the character n-grams. This makes it more versatile than the original Word2Vec, where the smallest unit is a single word.

For queries that are phrases, I assigned them to a vector that is the normalized mean of its word vectors. This is inspired by what FastText does in the source code for its “most similar search” capability.

The culmination of these years of research has turned our queries to vectors. At this point, they are useful as input features for various machine learning models, or to be further analyzed, as explained below.

T-distributed Stochastic Neighbor Embedding (t-SNE) of some sample query vectors. t-SNE is a popular method for visualizing high-dimensional vectors; it is also used to categorize our learning content. We can see clusters indicating relationships among words: “free” and “introduction”; all the terms “with r”; “arts” and “music”; “politics” and “governance.”

Mapping to Features (Similarity Search)

We can map these query embeddings to any set of words based on their numerical similarities. Coursera has a database of thousands of skills, many of which are tied to course offerings. For example, the course Becoming a Changemaker: Introduction to Social Innovation by University of Cape Town teaches the skills “social entrepreneurship,” “sustainability,” and “resource management.” Besides connecting queries to this database of skills, this technique is widely extensible to other mappings, like with a list of different universities or words related to difficulty levels.

To compute the similarities between these two sets of vectors, the queries and the skills, I use Faiss, a library of C++ algorithms that stores the vectors in a huge matrix and builds an index of clusters based on their similarity.

This enables a precise yet efficient calculation of a query’s “nearest neighbors” across the set of skills, i.e. skills with the most similar embeddings, by focusing on clusters.

Sample results. The “query” column contains raw search queries, the “top features” are the most relevant skills, and the “distances” are a measure of similarity that ranges from 0 to 1, where 0 means that they are identical.
Sample results with a higher distance threshold, allowing less similar features to be tagged.

Analysis Possibilities

Once the queries are tagged with skills, they can be easily clustered and analyzed.

t-SNE on queries that have been mapped to skills in our database. They form four distinct clusters, representing the skills “digital media,” “agile software development,” “mechanical engineering,” and “creative writing.”

Let’s take a look at the skills whose associated queries lead to the highest rates of course enrollment.

Enrollment rates for popular skills (popularity determined by Query2Vec).

Why does psychology enjoy so much attention from searchers yet have such a low enrollment rate? The top results from a search for “psychology” on Coursera are introductory courses and positive psychology courses. But psychology is a broad topic — could searchers want to learn the biological or computational aspects of the field? This requires further investigation of the data behind the queries tagged with “psychology.” This type of analysis clues us into the unmet needs of learners, enabling us to develop more relevant content to help learners achieve their goals.

These methods and statistics provide a fresh perspective into what learners are searching for, which has applications in understanding unmet demand. Developing new course content, improving course branding, and fine-tuning our recommendation systems are only a few of the possibilities. I also set up scripts to vectorize, tag, and analyze search queries daily for use in our rankings and recommendations models.

Although my exploration is in English for easy sanity checking, the Query2Vec pipeline can be used in other languages and limitless other contexts.

Acknowledgments

This project wouldn’t have been possible without Jaya Chavern’s mentorship and Allie Rogers’s search-related insights. Thank you to the many people across the Data Science and Engineering teams whose support and excitement gave my work momentum and purpose.

About the Author

Myra is an undergraduate student at Caltech. She was an intern at Coursera in the summer of 2019. Her curiosity about how the company makes data-driven decisions inspired this independent project, which she designed, explored, and built during the final three weeks of her internship.

If you’re interested in applying data science or engineering to advance the future of education, check out our open roles here.

--

--