The PJC Synthetic Data Thesis — Why We Invested In Synthesis.AI

Synthetic data is a difficult concept to understand when you are first exposed to it. Basically, it means you create fake data to put into a machine learning model so it performs better, because it doesn’t have enough data on it’s own. Most people, upon hearing this, think that it sounds like we are making things up for our machine learning models. In this post, I will try to put synthetic data in layman’s terms, explain some use cases, highlight the PJC thesis, and discuss our investment in

What Is Synthetic Data?

The first thing to know about synthetic data is that there are use cases where it works and use cases where it doesn’t. Most of the early synthetic data players were focused on self driving cars. A car can only move so fast around the world, taking in data, and may miss some scenarios that we can imagine. To get around that, what if we built a virtual world and let the car navigate that? Yes, it lacks some features of the real world, but it still is valid data into the machine learning model that runs the car.

This can be expanded into other idea, but not every idea. To generate good synthetic data, you need two things. First, you need a sample of real data, and secondly, you need an idea of whether you can generate good synthetic data from that real data. You have to understand the space of what is possible.

Examples of Synthetic Data

As another example, say I want to build a robot that picks aluminum cans out of the trash. These cans can crumple in thousands of different ways. If I have 10 pictures of crumpled Coca-Cola cans, that may not be enough to train a machine learning model to identify cans at a high enough rate. I need more data. One option is to go crumple 5,000 more cans, in all different ways, in all different weather, laying next to all kinds of different things in a landfill. That’s expensive. The easier option is to take a physics engine and create 5,000 fake crumpled can images to add to my data set, that are based on the real crumpled can.

Synthetic data works well when tweaking on a real pic is easy. If you are using a machine learning model to unlock your phone with your face, and it doesn’t perform well because of unusual lighting, or because sometimes you wear glasses or hats or have facial hair, I can take one real picture of you and create very realistic pictures quickly that have you in glasses, you in a hat, you with a different hair cut, etc.

For some areas though, synthetic data works less well. Given 100 sentences of English, I may not be able to accurately generate 10,000 synthetic English sentences to put into a NLP model. The sentences may be very basic and not helpful to the model. Knowing when you can use synthetic data and when you can’t is important.

If you want to read something more technical as an example, here is a blog post from the Google AI team about how helped create a synthetic data set for them.

The PJC Thesis

Synthetic data is very very very early as a space. So, if you are reading this more than 6 months after it was published, it is likely our thesis has changed based on what we have learned and what has happened in the market. But at the moment, our thesis is this:

  1. Small data AI techniques are still a long way off, and so data remains a major blocker to training machine learning models for certain tasks.
  2. Advances in data generation, particularly with GANs, are making synthetic data easier to do than ever before.
  3. More and more applications are going to have ML models built in, and synthetic data is going to be the most cost effective way to build data sets for many of those models.
  4. The performance of those models will vary over time as use cases expand, underlying data sets drift, and other things impact the models. So the models will need to constantly be tweaked with new data.
  5. The current software work flow of identifying a bug, understanding the cause, fixing it, reviewing the code fix, and publishing it to the code base is similar to the workflow that will develop for a model. A model will underperform in certain production cases, a “bug” will be filed, a data scientist will figure out what data may improve the model, that data will be generated, the new model will be tested, then the new model will be published.
  6. In most cases, the generation of synthetic data will benefit from scale and so it will make more sense to hire a platform to do it instead of build your own internal synthetic data tools.
  7. As such, the winner(s) in the synthetic data space will be those that sit at the intersection of a) best use cases and b) best integration to existing workflows.

Much of this could turn out to not be true. Synthetic data could be too difficult to generate for most use cases. It could be something people use to bootstrap minimum viable models but not needed to fix models later. It could be that small data AI gets here fast enough we don’t need more data. But, we work in a world of high uncertainty and these are risks we are comfortable with.

Our Investment

When we looked at the synthetic data market and decided to take a bet, the vast majority of the companies we saw were just ideas, or prototypes. Very few people are working on independent companies just yet. Synthesis already had some published research on synthetic data use cases, and already had some paying customers. The customers we spoke to loved the product and all said they planned to use more synthetic data in their businesses over time, and those in pilots expected to have an enterprise license for Synthesis down the road.

The CEO, Yashar Behzadi, is a PhD with a strong background for this work and previous startup experience. And Head of Product Matthew Moore is an expert in synthetic data for robotics and an entrepreneur I backed before when I was angel investing. They have surrounded themselves with a great group of advisors, investors, and early customers.

Their approach of thinking about synthetic data workflows and the tool integrations it would require matched our own, so the investment seemed like a great fit. So we are excited today to announce that we’ve invested in (Note, this is my second investment at PJC).

Going Forward With Synthetic Data

If you are an executive or technologist and find yourself lacking the data you need for high performing machine learning models, feel free to reach out if you would like an intro to the Synthesis team. They already have an impressive customer list.

If you are an entrepreneur working in the synthetic data space, and not competitive with, please reach out and tell us about your business. We anticipate making more investments in this general space.




The PJC Ventures Team Blog

Recommended from Medium

How AI can power your customer experience (cont.)

The Beginner’s self-starter guide to ML

3 Types of Distance Metrics in Machine Learning

Credit Card Fraud Detection

Getting Started with Python & Machine Learning from Scratch

Love at First Sight is Outdated, Try Machine Learning Engineered Love

Implementing a Photo Stylizer in Python using a QuadTree Algorithm

Using ML and Optimization to Solve DoorDash’s Dispatch Problem


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rob May

Rob May

CTO/Founder at Dianthus, Author of a Machine Intelligence newsletter at, former CEO at Talla and Backupify.

More from Medium

AI & Law: Seeking Legal And Ethical AI Via Use Of Differential Privacy For Machine Learning

Recommendation Engines — A breakthrough in AI

The explainable AI boom: Why is XAI important? And why now? — by Tim Leers