We will investigate why and how we have developed a ML platform at Qucit, the challenges we have faced along the way, and the things that can be improved for the next iteration
Machine learning (ML in short) is an integral part of Qucit’s DNA. Indeed, Qucit’s mission is to offer “better user experience for cities using AI”. For that, we are developing 4 different verticals with a focus on mobility:
- FleetPredict: Improving the user experience of self-service mobility by anticipating demand
- ParkPredict: Improving the experience of motorists by anticipating their behaviors.
- RoadPredict: Anticipating incidents and upcoming traffic conditions to improve road safety and network usage for users and patrol officers
- ComfortPredict: Improving the user’s experience of shared spaces by identifying and quantifying well-being and discomfort factors.
Urban phenomena are notoriously hard to model. They happen in a dynamic setting and are multi-scale by nature. Fortunately, we have a lot of data at our disposal and thus rely heavily on machine learning models to make sense of these 4 verticals.
This reliance on machine learning means that the development of models should be as seamless as possible and going from a prototype (usually a jupyter notebook) to production as fast as possible. That wasn’t always the case.
In fact, in the early days of Qucit, developing and putting a new ML model into production was challenging. Indeed, each new model meant writing new training pipelines, code for features creations, model deployment, and many more. This resulted in a lot of code duplication, strong coupling between data science and API development, and made the deployment of new models a diffused responsibility : should it be done by a data scientist, a back-end engineer, or both?
Moreover, having a different workflow for each vertical would be a nightmare to manage, deploy, and ultimately scale.
Is there a better solution?
An ML platform to the rescue
As you may have guessed it, a platform is the way to go.
We started developing a suite of ML services that form the backbone of our ML platform. These services form the Urban Predictive Platform or UPP for short.
Let’s take a step back and focus on why we needed an ML platform in the first place:
- Enable the company as a whole to rapidly iterate on the various verticals: this is a core principle that governs a lot of what we do at Qucit. More specifically, enable data scientists and product managers to build and deploy models more reliably.
- Offer scalable tools for building models: ideally, data scientists shouldn’t care about how their models will run but only focus on finding the best model for the task
- Streamline and standardize the ML workflow: a standard workflow means less work to maintain and support existing models.
- Easier monitoring: having a uniform platform makes monitoring centralized and thus easier to manage.
- Implementing and sharing best practices for building ML products: once a new feature or model works well for one vertical, it can easily be tested for other ones. An ML platform makes this task easier to achieve.
As a first measurable effect, deploying a model before the inception of these ML services took us between a few weeks to a few months. Nowadays, it is between few hours to few days (and we are always working on improving the process).
Now that we are convinced that an ML platform is needed, what components should we build?
Building blocks of an ML platform
Developing an ML model is much more than just calling fit and predict methods (incidentally, these are the easiest parts to implement). The figure below (from here ) shows that the ML code section is the smallest one and a whole lot of other services are required, including:
- Data collection
- Features generation
- Models training
- Models tracking
- Models deployment
- Models predictions
- Models monitoring
- And many more things
To solve most of these steps, the data science team at Qucit has developed various services and libraries over the years. Here is how the current architecture looks:
In what follows, we will explore in more details each one of these components.
ML internal libraries and services at Qucit
As mentioned above, the UPP is a set of services and internal libraries that constitute our ML platform. It is formed of three main libraries namely Qfeatures, Qscientist, and Qpreprocessor, and one service namely Qucit Models.
A features library that contains various tools for producing:
- Calendar and solar features: day of the week, week of the year, weekends, school holidays, bank holidays, sun’s position, and many more.
- Geographical features: mainly open street map features.
- Weather features: temperature, precipitation, pressure, wind, and many more.
A tool box of utility functions that help data scientists work more quickly. As an example, it contains an hyperparameter optimization pipeline. Most importantly, it contains models’ definitions.
A library that contains the definition of ML training pipelines, with one folder for each vertical. Here are some of the steps that are included in any ML pipeline:
- Data loading: a function for getting data, mostly from S3 (cloud object storage).
- Data processing: a function for aggregation, missing data imputation, target transformation, and other processing steps.
- Model training: importing the model from Qscientist library and fitting to the training data.
- Model saving: saving the model and uploading it to S3.
This is our central production service and it is used for the following parts:
- Storing models’ binaries and associated metadata: serialized models and metadata about the associated targets, available features, and so on.
- Generating production features: using Qfeatures and real-time datasets to produce features that would be consumed during production.
- Serving predictions: an API layer that is responsible for transforming queries from the various verticals into appropriate predictions.
A use case
To make things more concrete, let’s walk through a typical ML workflow using the UPP from model prototyping to production.
Let’s say that a data scientist working on ParkPredict has a new idea to improve the existing model using a newly engineered feature.
First things first, she will start prototyping on her laptop or using our dedicated data science server if she needs more resources. Once the prototyped model shows promising results offline, it is time to make it production-ready.
Here is how it is done:
- A new training pipeline is added to the Qpreprocessor library: this pipeline includes data loading (from S3), data processing, hyperparameters optimization, and model serialization.
- Once the model is trained, it is uploaded to S3 for later use (and to keep track of all the models)
- Then, the model is deployed to Qucit Models using a CLI (to a staging instance first then production)
- The new model is now available and can be requested by the ParkPredict API
That’s it, as simple as that. That being said, there are some limitations to the current workflow.
Improved ML services
One of the current limitations is the features creation and usage. Following the use case, let’s imagine that the new found feature isn’t yet available in Qfeatures. What should be done then?
For now, the features are produced either within Qpreprocessor’s pipeline definition during the training phase or within the Qucit Models service during serving (both using the Qfeatures library).
Thus, adding a freshly minted feature requires three things: adding it to Qpreprocessor, Qfeatures, and Qucit-Models.
This approach works (to some extent) for now but isn’t scalable and creates a lot of friction.
It would be much better to have a centralized service for managing features thus easing the effort to develop new ones. Here is how the new service will handle the training and predictions parts:
- Training features: these will be computed by batch and stored on S3. Once a set of features is computed, it can be re-used by data scientists for other models.
- Predictions features: these will be computed more frequently (let’s say once every 15 minutes) and will be available for computing predictions. Only the latest features will be stored on the production database.
For those familiar with data engineering concepts, a lambda architecture could be a suitable choice for this component.
Finally, notice that there should be a library for easily defining features from existing ones and adding new ones. It should also be a wrapper around S3 for easier access.
Models’ hyperparameters optimization and retraining
For now, the models’ training, hyperparameters optimization, and retraining are all done manually. As mentioned in the use case above, these steps are done by the data scientist in charge of the model.
This process is time-consuming and laborious which increases the risk of missing it altogether: for example forgetting to tune hyperparameters or retrain models at the desired frequency.
A better approach would be to define a retrainable pipeline, schedule it on a retraining service, and let the service handle the process. Notice that there would be two scheduling frequencies: one for hyperparameters optimization that would happen few times (once at the start and maybe once a year) and retraining with new data that would happen more often (once a month for example).
An important part of developing new ML models is prototyping (ideally as many models as possible until finding a good one). That’s a challenging task for many reasons. One of these is the difficulty of tracking the various iterations: what features were used, when did it start and when did it end, what hyperparameters, and so on.
That’s why we are working on a tracking solution that will store models’ metadata (creation date, model creator, model location on S3, environment packages, and so on) automatically each time a new ML pipeline is run. This tracking service will make model deployment easier since it will synchronize the needed metadata to Qucit Models.
With these new elements in mind, here is how the next version of the UPP will look like:
For the curious among you, notice that there will be more technical details about each component in later blog posts, so stay tuned.
This blog post was a high-level overview of Qucit’s ML platform components, how it works, it’s current limitations, and how to improve things in the future iterations.
Keep in mind that this is just a glimpse of the data ecosystem at Qucit.
Indeed, in addition to the UPP, we have spent the last months developing an Urban Analytics Platform (UAP for short) that provides unified analytics tools (time series, histograms, bar plots and so on) for the various verticals. These same tools are also used for monitoring the predictions of the underlying models. You should expect a blog post about the UAP in the upcoming months, so stay tuned.
Finally, if you find this kind of work interesting, Qucit is always looking for talented data scientists and/or data engineers.
To go beyond
Here is a list of additional resources to learn more about other ML platforms (for urban phenomena):
About the author
Yassine is a data scientist at Qucit. Among his missions, he ensures that the various data science models integrate seamlessly into Qucit’s various products, APIs and internal tools. In addition to these functions, Yassine is actively involved in sharing knowledge internally and onboarding new data scientists.
Yassine holds a degree in general engineering from Ecole Centrale Paris and a Master’s degree in statistics from Cambridge University.
Before joining Qucit, Yassine worked at Suez Environment on the detection of anomalies in the drinking water network. Just before that, he conducted econometric consulting missions.
Outside of work, Yassine enjoys testing new data science and visualization tools, participating in Kaggle challenges and answering questions on Quora. Plus, he likes running and watching twist movies.