Lessons Learned Working on Real-World ML Pipelines

Published in

Latch Engineering Blog

5 min readApr 6, 2021

Good models on their own aren’t enough to ensure the success of ML projects

At Latch, we work at the intersection of hardware and software, which creates interesting challenges when working with data collected from our devices. Because hardware and software components are closely interconnected, a small change in firmware might change how a certain data field is stored, which creates a downstream effect on any analytics or ML applications.

While implementing one of our first ML pipelines in production we were able to overcome a number of challenges during the initial data exploration, modeling, and productionization.

Background

Since Latch started manufacturing locks five years ago, we have gone through multiple iterations of hardware and firmware. As a result, we have a wide range of smart locks installed in buildings across the country. These devices are, for the most part, powered by AA batteries, which enable all the “smart” functionalities, such as accessing one’s apartment using their smartphone. It is therefore important to ensure that batteries are replaced promptly before the device runs out.

The challenge is that most locks are not connected to the internet — they only exchange information with our servers when unlocked via phone — so it is hard to know the battery level for our devices in real-time. In addition, devices that are used more frequently will drain their battery faster, while less used devices will last longer, so there is no accurate rule of thumb for when each device might need their battery replaced.

For these reasons, we decided to create a model to predict days left until each device will need its batteries replaced, and we learned quite a bit along the way.

Lesson #1 — When handling an unfamiliar dataset, ask the experts for advice.

When we started to visualize the battery usage curves for a few devices at a time, we noticed something odd. For some locks, the battery percentage decreased dramatically — by 10–20% — in a short period of time, and then recovered back to the previous level.

Original battery percentage level (blue) and “corrected” battery percentage (orange)

At first, we were pretty puzzled. Should we simply disregard these episodes as faulty readings? Is this a problem with our data, and if so, how serious is it?

After ruling out issues with our data processing, we reached out to our in-house experts, the hardware team. They quickly pointed out that the odd behavior in the battery data reflected the issues with battery measurements in a previous firmware version, which had been corrected in the following release. In short, the battery voltage was being measured while the lock motor was activated, which led to a lower level that you’d expect if the motor was idle. Together with the hardware team, we devised a strategy to “correct” the faulty reading and approximate the “true” battery level, which significantly improved the quality of our training data.

We realized that without asking the opinion of our domain experts, we might have picked another option such as averaging or removing the observations altogether that could have added bias to our dataset.

Lesson #2 — ML performance metrics don’t always convey business value

After a first attempt at cleaning the dataset and creating features, we decided to start training different models while sharing results with different folks involved in this project. We decided to use the root mean squared error (RMSE) as the metric to quantify the accuracy of the model predictions. If you have dabbled with ML before, this term will probably sound familiar, but we realized that it wasn’t intuitive enough for folks outside of our team. Furthermore, the metric itself, which measures how “far off” each prediction is from the true value with additional emphasis on predictions that are way off — wasn’t doing a great job at answering the fundamental question: “Will this help reduce cases where locks run out of batteries before the building staff can replace them?”

So we spent a little bit of time looking through our validation set and trying to understand how the model predictions translated into specific user scenarios. How many times were we underestimating the remaining battery life? If the building staff took a week to replace the batteries, what are the chances of them getting fully depleted?

What emerged is that the risk of overestimating the remaining battery life was far greater than underestimating it. In the first scenario, we could risk giving property staff a false sense of confidence that their devices’ batteries will last a few days longer, when in reality, they should be replaced right away, which might lead to a worse resident experience. Diving into our result data beyond simply listing the performance score of the model helped us understand the business implications of each solution and convey the value to a broader audience.

Lesson #3 — The only constant is change (in the data)

“Nothing is worse than having to babysit AI”
— me, while thinking of a quote to add to this piece

Finally, after a number of iterations and feedback rounds, we had our prediction pipeline ready to go. Things were going smoothly….except for when they weren’t. Turns out, when you are automating a lot of transformations of never-seen-before data, a lot could go wrong. A small change in the schema or data type could result in your carefully crafted ML pipeline spitting out nonsense, if not grinding to a halt entirely. On top of that, the process of understanding what went wrong is tedious, as it might involve scanning logs looking for clues or querying a number of SQL tables.

We quickly realized that in order to stay sane, we had to have visibility on every step of the pipeline, which involved checking on the data that was coming in and out of each component. So we invested time building a component that scans and validates data at every step of the process. Luckily, we stumbled across an amazing tool called great_expectations that made the task a lot easier. By ensuring that data looked a certain way as it was coming in and out of the prediction pipeline, we could not only prevent issues with schema changes or data issues, but we could also monitor when the recent data was different enough from what we had collected historically to retrain the model. This proved to be a huge advantage, as we no longer had to closely supervise the model inputs/outputs, and dramatically reduce the time to identify root causes when something went wrong, which freed up significant time to work on other initiatives.

Next steps

Of course, our learning journey doesn’t end here, and we are currently exploring a number of different tools and approaches that can help us build machine learning applications quickly and reliably. For instance, we are looking into ways to automate the selection of the best performing model for production or leveraging serverless frameworks to support near real-time predictions.

The lessons we learn along the way will enable us to leverage our ML capabilities to help solve business problems across a number of different domains, ranging from supply chain and manufacturing to creating new exciting features for our products.