My time at Quizlet as a Data Science, Machine Learning Intern

Ridwan Olawin
Tech @ Quizlet
Published in
5 min readOct 6, 2022

I have always been inquisitive about the data science field, especially in relation to how it can be applied to real users. My experience, starting from undergrad, studying Software Engineering at Drexel University and then pursuing my masters in Data Science at Columbia University, has always given me a platform to learn how data can improve user experiences when utilizing any new software feature. And my experience at Quizlet during the summer of 2022 revealed the principal role that data can play in influencing strategic decisions for a business.

I received my internship offer from Quizlet through Code2040 as a fellow. I started my internship in the first week of June and wrapped up around the second week of August. Typical of a data science internship, I was matched with a Machine Learning (ML) team, assigned a mentor, and tasked to build a well-studied ML recommendation system.

Project: User to School recommendation model

The objective of my project was just as the title stipulates — predicting the school a Quizlet user belongs to. The reason for building this project was that study sets that are generally recommended by Quizlet for a user depend highly on finding content associated with the user’s respective school. And one major thing to note is that since only about 40% of United States (US) users input their school explicitly, personalizing content for users without a school association becomes challenging.

Like a typical recommendation model, my first approach was to generate candidates from which I could use their features to make a potential school association. I selected users, using SQL, who have recently been active (last seen in the past 6 months) on Quizlet and also looked at all the study sets they have studied in the past 5 years (see figure 1).

Figure 1

One of the reasons for limiting the candidates to users from the last 6 months was that data points from 2020 and 2021 had a high degree of variability during feature selection, due to users switching from in-person to virtual education. An example of one of those features was the distance between a user and a potential school association.

For many sets on Quizlet, there are one or more school associations from which I generated features to associate a Quizlet user and their school. It’s important to note that I used sets as my base approach to generate school associations for candidates because we find a user’s true school about 45% of the time among the sets the user has studied. I also used other approaches for candidate generation, many of which evaluations can be found below (see figure 2).

Figure 2

Next was to generate features for my model. I experimented with many and then chose the following:

  • Distance between a user and a potential school
  • Binarized whether the potential school was a university or high-school
  • One-hot encoded age group (i.e.15–20, 21- 45)
  • Percentage of sets studied at a potential school per user
  • The reciprocal sum of year difference (from the current year) of the decay of sets: the objective here is to give more value to sets that have been studied recently.
  • Binarized whether the user’s email domain matches the school’s website domain
  • Binarized whether the user’s current web locale country matches the school’s country
  • Target label: if the user’s true school matches a potential school

Choosing these features allowed me to generate training data, from which I then applied some machine learning algorithms on. Given the way the candidates were chosen, it’s noteworthy that for some users, we had many potential school associations simply because some users study many sets and also because a study set can be associated with multiple schools. If I were to have left my training data as is, most machine learning algorithms tend to handle these data points in a computationally expensive manner and hence, I had to limit my data in some way. I simply did this by selecting the top three schools a user has studied the most from by ranking based on the percentage of sets studied at that school. In addition, I applied undersampling because otherwise, we would have more labels of zeros than ones. This resulted in about 7.5 million total candidates. Since the data had been framed in a binary classification manner, I tried binary classification algorithms such as logistic regression, boosted tree, deep neural network (DNN), and AutoML. The boosted tree had the highest accuracy (83%) on the validation data set and was chosen as the final model.

The model was able to predict about 57% of users without schools with the confidence we desired, well beyond our goal of predicting at least 10%. I was able to push the model into production and set up monthly re-training of the model to ensure fresh and more accurate predictions were being made.

Future improvements to the model include:

  • Implementing a real-time service for a zero-query search state to allow users to add their schools easily
  • Continuing to improve the accuracy of the model by exploring the interaction between users and their classes.

Key Takeaways

One of Quizlet’s values is “teach yourself something new” and that is the value I upheld throughout my time there, and will continue going forward in my career. Working on this project exposed me to the machine learning cycle of building a product from scratch and moving it into production. It exposed me to the different ways data can be modeled and also the most difficult aspect of many machine learning projects, which is conceptualizing features on which to apply an algorithm.

Towards the end of my internship, I also participated in an intern hack-week/hackathon, a full week of hacking dedicated to building anything that can help enhance the Quizlet experience. I worked with a group of four interns, including me, to build features into the Quizlet widget. The features allowed users to keep track of their study progress such as streaks on days they have interacted with Quizlet or simply jumping back into viewing sets the user is currently studying. I was able to partner with a brand design intern and mobile developer intern which allowed me to learn more about developing a prototype via Figma and also building features into a mobile widget from scratch.

Thank you, Quizlet!

Thank you to everyone, specifically Maddy Gilbert (my mentor), for mentoring and advocating for me. I was able to share my passion for how we can enhance the education tech space, and learn about the possibilities of data science with the machine learning team. I know the skills I developed this summer will help me become a better data scientist as I proceed into a full-time role next year.

--

--