Sensor Fusion with Language Models

Representing signals from different sensors in natural language can enable Language Models to ingest information efficiently

Herscovici Robert
3 min readJan 21, 2023

In this article you will find:

  1. Introduction to Sensor Fusion
  2. A different perspective on Language Models
  3. How to fuse different sensor data in a Language Model ( you can skip to this section if you already know the first parts )
Image generated with DALL-E

Introduction to Sensor Fusion

Sensor fusion is the process of combining the signal from multiple sources in order to develop a more accurate representation. A simple example is how self-driving cars combine data from cameras and radars to detect objects around them.

A different perspective on Language Models

Language Models are (Large) Neural Networks that are trained on massive datasets of text, and develop an abstract understanding of natural language.

The way I like to think about it is this way:

Imagine an alien who knows nothing about our world — or even what a world is — but found a way to intercept all our text data. No images, no sound, nothing. Via a combination of memorization and efficient knowledge representations, it will form a general idea of what’s going on on our planet.

Having a general understanding can become incredibly useful for fusing different sensors and the reason is quite simple. When we look at a 2D map, we can easily navigate the real world based on it. Because of our general understanding, we know what each sensor represents. We know that a map is a flat drawing of the real world seen from above, we know it usually contains roads, etc.

But a randomly initialized neural network doesn’t know any of this and it will definitely struggle to navigate. This is just one example of why having a model equipped with a general understanding can become extremely valuable.

How to fuse different sensor data in a Language Model

In order to make use of this general knowledge already present in these Language Models, we have to find a way to represent the data from different sensors in a way that is understandable by a language model.

Let’s give an example:

We want to build a smart keyboard app for mobile. Our goal is to predict the next word that a person will type. Let’s take a simple example:

Input: “Hi, just finished work. I think I’ll take a “

Possible Outputs: “walk”, “bus” etc.

Both words have high probabilities, so we can say that the model doesn’t have enough information to answer correctly.

Now, let’s say we also have access to live weather data from the user’s location. When we analyze the weather data, we can definitely say that “walk” is much more probable when the weather is sunny, and “bus” is more likely when it’s raining.

Great, but how can we include this information to improve our model? Of course, we can build a one-hot vector to encode this data, but then we need to integrate it with our model and it might require some serious engineering.

Another way is to just include it in our input:

Input: “It’s raining outside. Hi, just finished work. I think I’ll take a “

Even without retraining the model, the output of “walk” has already decreased considerably, while the probability of “bus” increased.

This was just a simple example, but you can do this trick in almost any use case where a natural language representation of the sensor data makes sense.

If you enjoyed this article, follow me for similar content.

--

--

Herscovici Robert

ML Engineer. I write about Computer Vision, NLP and AI in general.