MABEL — How we build AI at Lepaya Tech

Tobias Hoelzer
Lepaya Tech
Published in
5 min readNov 2, 2022

Do you like minimizing functions? We do!

MABEL: MAchine Based ELearning.

Mabel also happens to be a lovable cartoon character from the series gravity falls that I like. Who came first? Hand in your guess during the interview :-)

Lepaya has disrupted the global L&D landscape with impactful learning methodology driven by data. But now comes the next step: Automated feedback on real life scenarios, like giving feedback or presentations.

The MABEL AI Squad at Lepaya develops Machine Learning Algorithms on multimodal data such as videos, audio and text to help users improve their conversation skills in real life scenarios. As the Head of AI @ Lepaya I lead a team of Software Engineers and Data Scientists (one of them could be you) that collect videos, generate datasets, guide the annotations, build machine learning models and deploy these to production. We work very closely with the other squads. One of them builds the Lepaya Flutter App, where the videos are collected. Other squads build internal tools and integrations (for example into Microsoft Teams).

Before a classroom session with a trainer, we give users the option to practice in a safe AI environment. Users upload practice videos to the Lepaya App, such as their next public presentation. Our AI system analyzes the video and extracts the key indicators of any speech/conversation: gestures, facial expressions, eye contact, voice, word choice, etc. It then evaluates them, and gives feedback to the user on how it went, and how to improve.

How do we do it?

Our MABEL pipeline runs a lot of machine learning models at the same time, so we need a solid process to collect data and develop the models.

  1. Upload Videos: First of all, we collect videos, internally or through our App. Our App is developed in Flutter and managed by other squads.
  2. Analyze: Then we run the videos through our MABEL API that analyzes video, sound and text. We use python and poetry in docker within Sagemaker on AWS and of course all the ml libraries that you know and love. Think numpy, pandas, tensorflow, pytorch, sklearn and others.
  3. Build Datasets: From all the beautiful input data, we build datasets. We use luigi to keep track of all the transformations and make it reproducible.
  4. Annotate: Machine learning models need annotations, lots of them. We use labelstudio to annotate our datasets. E.g. filler words, gestures, facial expression and the overall ratings of how well you performed in your presentation.
  5. Develop ML Models: The most exciting part. With these gorgeous annotated datasets, we develop ML Models. Some of these are audio models (wav2vec, filler word detection), video models (human keypoint detection, emotion classification) and regular models (to give a rating). We develop in python and use frameworks like mlflow to keep track of experiments.
  6. Deploy: The most scary part. After some Q/A checks we deploy the updated MABEL pipeline with our fresh new models. much model. very AI. Wow.

An example on how to detect gestures and give a rating

As an example problem, we want to detect how well you use gestures in your presentation. This is what you'll be doing in your everyday work.

We need to solve 2 machine learning problems. As an input, we have a video.

  1. Detect Observables z can you guess I studied physics? Observables are float numbers in a video that a human could observe as well. For example the amount of time the person had the right hand in their pocket.
  2. Map these Observables to a Feedback a. The Feedback is a rating between 1–5 stars. You should not have your hand in your pocket during a presentation — but you knew that I hope?

Here is how we do it:

  1. We turn the input video into a tensor [x].
  2. We use a keypoint detection model e(x) → [y] that extracts raw features [y]. For example the (x,y) coordinates of the hands in all images in the video.
  3. We use feature engineering functions f (y) → [z] to turn those raw features [y] into observables [z]. To make sure we do this right, we use an annotated dataset.
  4. We then create a mapping g(z)->[a] from those observables [z] into a 5 star rating a. We also use an annotated dataset to make sure we follow a human rating.

So basically, all we need to do is minimize (g·f·e(x)-a)ˆ2→ 0

Do you like minimizing functions as much as we do?

We use Scrum to organize ourselves and we believe in honoring the maker schedule for developers. If you like working with multimodal data on an open-ended problem with real impact on humans, we are the right place for you.

If that didn't convince you, we also like to share animal facts at the end of some of our standups. Did you know that Drunk Zebrafish Convince Sober Ones to Follow Them Around?

If the answer to above question is yes, then apply at Lepaya now and help us find the best g’s, f’s and e’s across multimodel video, audio and text data.

--

--

Tobias Hoelzer
Lepaya Tech

Head of AI @ Lepaya / Co-Founder&CTO @ vCOACH (acquired by Lepaya)