Empowering Mobile Apps with On-Device Artificial Intelligence

Explore the on-device AI state of the art and learn how to leverage the power of ML Kit for Pose Detection on Android and iOS.

Gonçalo Martins
Pink Wall
8 min readSep 28, 2023

--

Generated by Midjourney

Over the last few years we have witnessed a huge increase in the development and use of artificial intelligence. Due to the high complexity of the existing algorithms, the computing power needed is also high and costly. This means that they are usually used on the backend side.
As mobile developers with an interest in the field, we wanted to understand what could be done on a device level, without the need for an external server or service.

In this blogpost I will briefly discuss some of the available tools for on-device AI development, as well as provide an overview of our first AI iOS app: a push-up counter!

On-Device Artificial Intelligence

On-device AI has many advantages over using a server. To name a few:

  • Low latency — since there is no networking, there is no latency associated with requests;
  • Works offline — no networking means we don’t need an internet connection to make sure the app works;
  • Privacy — all the data is processed on the device, so everything stays in the device. No data is sent to external servers;
  • Free of costs — only the device is used to process data, so there are no extra costs.

There are, however, drawbacks to using this approach. The main issues are related to the limitations of the devices, regarding memory, processing power and battery. Consequently, we may need to use smaller models, which translates into less power.

It is then important to consider the specific use cases to make a proper decision on what we should use. Let’s take a closer look at some of the available tools for on-device AI development.

ML Kit

ML Kit is Google’s turnkey solution for on-device artificial intelligence development, for both Android and iOS. It provides users with easy to use APIs for various AI problems in the fields of Vision and Natural Language. The algorithms make use of Google’s pre-trained models, but some of the APIs allow you to use your custom TensorFlow Lite models in case you prefer them.

Core ML

Apple has developed an AI framework of their own, Core ML. It supports the Vision, Natural Language, Speech and Sound Analysis frameworks which can be used to solve their corresponding ML problems. Alternatively, you can use your own models, which should either be converted to the appropriate format using Core ML Tools or created directly via Create ML. The disadvantage of using Core ML is that it only works on iOS.

TensorFlow Lite

TensorFlow is a widely used machine learning framework. TensorFlow Lite is TensorFlow’s solution for developing on-device AI on Android and iOS. It allows you to use models to make inferences on new data. TensorFlow Lite models are smaller in size and faster at making inferences. These can be obtained by converting a previous TensorFlow model to this format or creating a new one using the TensorFlow Lite Model Maker tool. In case you don’t want to create your own models, TensorFlow Lite provides you with pre-trained ones for common ML problems.

PyTorch Mobile

Much like TensorFlow, PyTorch is a very popular framework for machine learning development. PyTorch Mobile, its Android and iOS framework, allows you to deploy models and provides APIs for preprocessing data. You can also optimize existing PyTorch models for mobile usage. At the time of writing PyTorch Mobile is, however, still in beta.

Why ML Kit?

We ultimately decided to go with ML Kit to implement our app. It comes with a pose detection algorithm out of the box that is easy to use. It was also important for us that it is available for both iOS and Android. Even though, for now, we only want to build an iOS app, we feel like it is better to gain familiarity with an SDK that is also available for Android in case we want to use it in the future.

The Push-up Counter App

The push-up counter app is a simple one minute challenge for users to do as many push-ups as possible. It is based on Google’s push-up counting app for Android using ML Kit. From a high-level perspective, the process is fairly simple: the app uses the phone’s camera to film the user doing push-ups, uses the frames to get the estimated pose of the user, processes it, classifies it as either up or down and then increments the counter if needed. When looking at each of these steps in detail, they can be quite complex. Let’s take a closer look.

source: Google

Pose Detection

Since our goal is to count push-ups we need to know what the user is doing, namely the position of his body, at any moment during the challenge. Google’s ML Kit offers an off the shelf solution for detecting poses in images and video. A pose corresponds to the position of the body and is composed of a set of 33 landmarks. These represent key skeletal points, which can be used to determine if two poses are the same. They are described by a set of 3D coordinates, where the z coordinate corresponds to the position of the landmark relative to the user’s hips.

There are two available SDKs for pose detection: a base one and an accurate one. The first is faster but the poses detected are not as accurate. In our case, since we are processing video frames and performance is a concern, we are using the base one. If the quality of the pose is more important and speed is not as relevant consider using the accurate SDK.

Note: Pose detection will only work if the user’s face is present in the image, but it works for partial body images as well.

Classification

In order to count push-ups we need to be able to determine the class of each pose, that is, the category to which it belongs: up or down. To that end, we use a k-nearest neighbors (KNN) algorithm. Given a new sample, it determines the k closest examples — its nearest neighbors — in the training data, using a given distance metric. It then assigns the new sample the majority class among these neighbors. It is a non-parametric, supervised classifier, meaning, respectively, that it does not make assumptions on the data and it uses labeled training data. It is also simple yet powerful, intuitive and easy to implement, without the need of a training step (often called a lazy learner), but it is also sensitive to outliers and computationally demanding and slow for large datasets.

Our classifier uses a value of k=10, and a weighted Euclidean distance for the distance metric, since the value of z is not as relevant in this case.

Additional Steps

Simply running the pose detection algorithm and feeding its output to the classifier is not enough to achieve good results. We also needed to implement some additional processing steps, described as follows:

  1. Normalization
    KNN classifiers require that we work with normalized data, to remove the influence of the scale on its accuracy. Also, when thinking about this app, it’s simple to realize that the way the video is being shot is significant. Since landmarks are described as a set of coordinates, a slight change in zoom or position of the person in the image has an impact in these values. Normalization should fix these issues. First, to normalize the translation, we calculate the center of the hips and subtract its value from all the landmarks. Next, we need to calculate the size of the pose. To do that we look only at the x and y values. We compare the size of the body, defined as the distance between the center of the hips and the center of the shoulders multiplied by a pre-defined value, and the distance between the center of the hips and each landmark. The size of the pose is the maximum of all of these values. All landmarks then have their values divided by the size of the pose.
  2. Pose Embedding
    Instead of using all of the available landmarks for classification we use a set of calculated distances between landmarks to form the pose embedding. This allows us to remove unnecessary landmarks, such as facial features, and also reduce the number of dimensions.
  3. Outlier Filtering
    As previously stated, KNN classifiers are prone to outliers. Consider a scenario where a pose is very close to another, with only a couple of different landmarks. In this case, the distance can be quite low, even though the poses are of different classes. In order to filter the outliers, we first select the closest samples using a weighted maximum distance KNN algorithm and use these samples as the training data for the actual KNN classifier.
  4. EMA Smoothing
    Another problem we faced was push-ups being detected due to incorrect classifications, causing a big increase in the total number of push-ups. This happened quite often, as any sequence of up-down-up classifications would classify as a push-up and cause the counter to increase. To mitigate this problem we implemented exponential moving average smoothing. We keep track at all times of the last 10 classifications, and take these into account, as well as the KNN classifier output, to get the final classification, with decreasingly lower significance for older classifications. Let’s take a look at how this works in practice. Imagine we have a window with up values. The next pose is classified as down. Due to the influence of the values in the window, the final output will be up, but the down is added to the window nonetheless. If this was a misclassification and the next value is up, we prevented an incorrect count. If the next poses keep being classified as down, the final classification will eventually be down as the window keeps getting updated.
    The window can be reset every time a new frame is processed, based on the timestamp of the last processed frame, to avoid incorrect influence of older classifications.

Final Thoughts

Currently there are already some powerful tools available for on-device AI development. We were able to implement a push-up counter app and overall achieve good results. However, before deciding to start building on-device AI apps, consider everything we discussed in this blogpost. It is important to take into account the current limitations and properly evaluate each scenario to determine if the performance tradeoff is worth it.

With AI still on the rise it is also only natural that more and more use cases and tools continue to emerge. It will be interesting to keep following the trends and see what the future unveils for us mobile developers!

If you want to know more about this topic and other topics related to mobile development check out our website, www.pinkroom.dev, and follow us on LinkedIn, Twitter, and Instagram. Let’s connect!

--

--