Jan Erik, a serial computer-vision entrepreneur

Jan Erik Solem, Founder + CEO Mapillary
“A pioneer and mathematician at heart, Jan Erik is a key influencer in computer vision in Europe. He started in the field long before it was fashionable and founded facial recognition start-up Polar Rose in the early 2000s, which he later sold to Apple. He then spent several years leading a computer vision team there. When Jan Erik and I met, he was already working on his next big idea, Mapillary, which is revolutionizing the way we build maps through collaboration. It continues to be a remarkable journey to work with him and the Mapillary team to make a meaningful difference in people’s lives through harnessing and analyzing surrounding data sourced via simple cameras, giving us up-to-date maps faster and more efficiently.”

Ekaterina Almasque, Managing Director at Samsung Catalyst Fund

1. Who are you Jan Erik Solem, and what influenced you the most to become active in the field you are today?

I normally describe myself as a computer vision person. I’ve been in computer vision in different forms since the late 1990s — as I went from being an undergrad to assistant professor, all my academic work revolved around recognizing objects in images. I then went onto starting my first computer vision company, Polar Rose, while I was still doing my PhD as I saw a technology gap in the facial recognition field that I wanted to solve. I’ve always been applications-driven in that I want to build technology that solves real problems. That’s what we’re doing at Mapillary today as well — we use computer vision to extract map data at scale, to help create better and more detailed maps.

2. You’re best known for programming computer vision starting far before deep learning had its break through in 2012. From an application perspective, thinking of computer-vision enabled glasses or headsets for instance, how has computer vision developed since then and what to expect in the coming 3–5 years allowing the adoption within a broad set of applications?

We’ve seen much wider applications of computer vision over the past few years. Computer vision is now playing a significant role in everything from fashion show fittings, to drone delivery, augmented reality, production lines, and autonomous driving. The performance of computer vision algorithms is now orders of magnitude better than before 2012 across a wide range of problems. One positive thing is that there is now a default toolset that works for many problems, deep neural networks. A negative is that training models for these networks is a resource-hungry exercise and to train a state-of-the art algorithm today costs a lot of money. We’re starting to see some interesting shifts here. Mapillary just announced In-Place ABN, for instance, a new method to slash up to 50% GPU memory required when training deep neural network models. We’ll see significant developments in this space over the coming years.

3. AI has one of the largest open-source communities out there. Besides the talent developing it further, we need data and computing power which are scarce or expensive resources. This influences the direction of the development of AI by those controlling these 2 resources. How do you see these limitations and what has to happen to change that?

There are several actors across the deep learning space that are addressing these limitations. If we start with data — data is everywhere, and it only becomes scarce when it’s made proprietary. At Mapillary, where we deal with data in the form of images and map features, we’ve made a commitment to always remain open so that anyone can access street-level images and map data. This benefits everyone that works with maps, regardless of whether they’re in the map making, automotive, or GIS fields. For example, Mapillary images are used for training algorithms by the Volkswagen Group and many other companies.

As for computing power, I think we’ll see significant developments over the coming years. New chip technology, dedicated hardware, and other inventions will allow computer vision and other means of AI to do more while using fewer resources.

4. Mapillary is already the second startup you’ve co-founded that, today, falls under the category ‘AI startup’. So I’d argue you’ve seen this technological trend already coming up before it caught the media or investor’s attention. Looking backwards knowing the status quo today, would you have changed your approach starting up and what are lessons learned for those just getting started?

I’ve always set out to solve real problems, but as engineers we sometimes discover problems that aren’t yet mainstream. That was the case with my first company. We were some of the first people to build facial recognition technology that automatically tagged people in photos, but potential clients that all added this feature years later initially rejected the idea, saying no one would use it. Unless the market is ready, it can be very difficult to get people to use the things you build.

Timing was one of the biggest learnings I brought with me to Mapillary. Another is team-building — we operate a fully distributed team, and it’s allowed us to hire some of the world’s best computer vision people. It’s still an unconventional way of running a team, but it’s the way to go if you want to access the best talent.

5. Human vision is most effective when combined with other senses such as feel/touch and hear/listen. Computer vision mostly acts on its own. How important are these other inputs when it comes to making computer vision identifying styles, materials, textures and more from images to then labelling them in natural language?

While computer vision has traditionally operated mostly on its own, we’re seeing new developments that add on other sensors. Autonomous vehicles, for instance, are fully dependent on computer vision, but also uses lidar and radar sensors to get a comprehensive understanding of the scene. With the convergence across many fields towards algorithms based on deep neural networks new possibilities to seamlessly include multiple sensor modalities have opened up. I think we’re going to see more of this in research and applications the next few years.

6. For years now you focus on labelling the world in images, so recognizing naturally occurring objects in outdoor scenes. What are the biggest discrepancies between computer and human performance when it comes to recognizing objects and scenes, interpreting those and taking action?

The discrepancies between computer and human performance go both ways. On a low-level understanding, computer vision outperforms humans. In many benchmarks, computers outperform human on isolated tasks like object recognition. They are also much faster and can analyze an image at all scales to find small details and text. Sign recognition, to create map data, or text recognition to decode and understand signs are two examples where computers do great.

When it comes action and intent recognition in scenes, however, computers are still weaker than humans. We’re still a few years away from a reality where computers can comprehensively recognise activities in outdoor scenes, and independently take appropriate action.

Thanks Jan Erik for you time & insight!

WorldSummit.AI

Join 6,000+ AI practitioners from over 100 countries at WorldSummit.AI this October!