Building Glassdoor’s Machine Learning Platform and Engineering Team

Published in

Glassdoor Engineering Blog

5 min readAug 1, 2022

In this post, I would like to share how Glassdoor built and scaled its ML platform and engineering team and what we learned along the way.

Historic ML journey at Glassdoor

Ten years ago, machine learning at Glassdoor started with a straightforward logistic regression model, which was used to power our job recommendations. Today machine learning is ubiquitous. It is used across most of our products — search, ads, reviews, salaries, interviews, benefits, and Fishbowl post recommendations, to name a few. Data quality, model innovations, hardware advancements, data infrastructure, resource investments, and tools propelled this growth.

Until 2021, our machine learning team was mostly comprised of scientists who were experts in math and statistics. They worked with data engineers to prepare data and build optimized models to solve problems. Batch deployments were done in Airflow by the scientists, while project engineers carried out the online model deployments. Infrastructure provisioning was left to the DevOps team.

This organizational structure was less than ideal, however, and created many challenges. Communication gaps and resource misallocations made deployments tedious and time-consuming. Monitoring, observing, maintaining, and debugging models were difficult. Machine Learning’s technical debt grew rapidly and had a negative impact on Glassdoor’s operations.

Adoption of MLOps framework:

To deal with the above challenges, we conceptualized real-time machine learning as primarily an engineering and infrastructure problem. As Scully et al. have discussed in this paper [1], only a small fraction of a real-world ML system is composed of ML code (illustrated by the black box in Figure 1), while the surrounding infrastructure is vast and complex.

By adopting a set of DevOps and Engineering best practices known as MLOps, we were able to effectively manage this complexity. As previously discussed in our blog, our initial approach comprised a mix of open source tools, commercial software, and proprietary tools we built to orchestrate our ML pipeline. This was the genesis of Glassdoor’s ML Platform. While building out this platform, we quickly ran into resource constraints. To help with this, we needed ML engineering talent and bandwidth.

Scaling the ML organization

As the number of models grew and demand for ML applications in new domains increased, scaling and effective organization of ML took on greater urgency. To the three pre-existing functional teams — ML Product, ML Science, and ML Knowledge Engineering — we added another: ML Platform and Engineering. Here are the teams and their roles:

ML Product — Responsible for Product and Business strategy. Scope ML problems and communicate needs to ML teams. Explain impact to critical stakeholders outside of ML organization.
ML Science — Understand problem scope, collect data, engineer features, train and evaluate models, and investigate changes to improve model performance.
ML Knowledge Engineering — Manage data labeling, review and monitor model performance in production, and oversee offshore contractor labeler operations.
ML Platform and Engineering — Build and manage ML backend platform infrastructure that handles data collection, validation, model training, evaluation, testing, serving, deployment, A/B testing, model monitoring, governance, and telemetry. Responsible for building and adopting tools that allow for data visualization, model analysis, and data annotation.

Building ML Platform and Engineering

The ML Platform & Engineering team is responsible for our ML platform, infrastructure, and tooling. Unfortunately, we lacked applied machine learning engineers, who are skilled not only in software engineering, system design, and infrastructure but also in applied machine learning algorithms, data modeling, and ML libraries.

We needed ML engineering talent, and our only options were to build, borrow, or buy that talent. Given the AI talent shortage, we relied heavily on building ML talent from within. We borrowed resources from our core DevOps, who provisioned and maintained compute and storage resources and performed typical data center operations. Data engineers, borrowed from our data engineering team, built data ingestion pipelines. Finally, we transformed software engineers into machine learning engineers through a year-long ML educational program with hands-on training and weekly coaching sessions.

Since our engineers came with very strong computer science fundamentals, software engineering, and system design experience, we could focus the curriculum on Machine Learning foundations such as data modeling and evaluation, basic probability and statistics, and applying machine learning algorithms to real-world problems. To cope with this rigorous syllabus we provided study time on Friday afternoons.

Here is the curriculum we developed:

Machine Learning Algorithms — Broad introduction to ML algorithms in regression, classification, and clustering.
Deep Learning — Fundamentals of deep learning.
Natural Language Processing — Given our data is largely unstructured text, learning NLP concepts is important. The participants learn how to build cutting-edge NLP systems using word vectors, embeddings, and deep learning techniques such as recurrent neural networks and transformers.
ML Engineering for Production (MLOps) — We learn how to combine foundational concepts of machine learning with the functional expertise of software engineering to build an end-to-end ML pipeline that automates all stages of productionizing ML models. Stages include exploratory data analysis, data preparation, feature engineering, model training, tuning, evaluation, serving, deployment, monitoring, governance, and retraining.

This educational program empowered our engineers by providing them with a deeper understanding of the entire ML system and helped them build scalable tools and platforms that have transformed the entire ML lifecycle, making it easier, more efficient, transparent, and in line with Glassdoor’s unique ecosystem. The ease and timeliness of our Fishbowl feed recommendation model deployment to production is evidence of this initiative’s impact.

This program provided engineers with the knowledge crucial to effectively collaborate and communicate with ML Scientists, Product Managers, Product Engineers, Data Engineers, DevOps, and ML Knowledge Engineers. Today our scientists deal with algorithmic and model development while our engineers focus on scalable solutions for model deployment and monitoring in production.

In addition to the above benefits, this investment in education has helped us continue Glassdoor’s tradition of investing in its employees. This has paid dividends not only in our engineers’ career growth but also has accelerated Glassdoor’s ML adoption, making a significant positive impact on the business. This curriculum also helps us quickly onboard new hires. Finally, and most importantly, this program has strengthened our culture of embracing and rewarding innovation; furthering Glassdoor’s mission “to help people everywhere find a job and company they love.”

Acknowledgments

This post wouldn’t have been possible without the help of Glassdoor’s ML Engineers, who went through this rigorous curriculum and transformed themselves into ML Engineers and ML Architects.

References

[1] Sculley, D., Holt, G., Golovin, D., Davydov, E., & Phillips, T. (n.d.). Hidden technical debt in machine learning systems. (NIPS. 2015)

[2] A. Anonymous. Machine learning: The high-interest credit card of technical debt. SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)

Building Glassdoor’s Machine Learning Platform and Engineering Team

Written by Malathi Sankar