Synthetic Data in E-commerce

Gregory Belhumeur
SSENSE-TECH
Published in
6 min readAug 12, 2022
Photo by Donald Giannatti

Synthetic data is information that has been produced artificially and can be used in place of actual historical data. When genuine data sets are lacking in quality, volume, or diversity, synthetic data can help fill the gap.

Since data is the new oil, understanding how to build your own well will soon be crucial for both tech leaders and data subject matter experts.

In this article, I will explore what synthetic data is, how to create it, and different use-cases where it can be applied.

What is Synthetic Data?

Synthetic data, as the name implies, is data that is generated artificially rather than from natural events. In most cases, it is generated using algorithms and is used for a variety of purposes. If you come from an AI/ML background, you can see synthetic data as a type of data augmentation.

Because synthetic data is still fairly new, its application across industries and types of data is still uneven. Image generation and business data generation are the two most popular use-cases in today’s world.

Image Generation is the most common application for synthetic data. Deep Fakes, videos generated by artificial intelligence to replace people’s faces, are a good example.

Leaders in AI and computing like NVIDIA are spearheading research in image generation, developing tools and frameworks to generate photo-realistic images. Other companies such as faceapp and wombo.ai are democratizing this technology by making it easier than ever to make anyone look like a super star, sing as Elon Musk or create pieces of art on demand.

Business Data Generation is steadily becoming a widespread practice as tech companies use synthetic business data for AI training, analytics, product development, and testing.

Synthetic business data is used in industries where consumer data is highly sensitive and regulated such as banking, insurance, health care, and telecommunications.

Creating Synthetic Data

There are multiple ways to generate synthetic data. This article will focus on the most common one, Generative Adversarial Networks (GAN).

GAN consists of agents, a generator that learns to create plausible data, and a discriminator that learns to distinguish real data from the fake data created by the generator.

Both the generator and the discriminator are neural networks. The generator’s output is connected directly to the discriminator input. The discriminator’s classification provides a signal that the generator uses to adjust itself.

Simply put, the generated samples are used to penalize the generator when flagged as fake by the discriminator. Those same generated samples are used as negative training examples for the discriminator.

When training begins, the generator creates phony data. Over time, the generator gets better at providing output that can trick the discriminator while the discriminator gets better at classifying fake samples.

Here’s a summary of the entire system:

Figure 1. Basic Architecture of a GAN.

There are variations of the “basic” GAN presented above. Some alternate architectures like InfoGAN[1], Conditional GAN[2] and Auxiliary-Classifier GAN[3] let you introduce data structures in the generator to be prescriptive about the sample to generate.

Figure 2. “Basic” Architecture of a GAN compared to various alternatives that enable the control of input variables.

Use Cases in e-commerce

Assisting AI Training

Synthetic data can be used to train AI models for scenarios in which limited data is available.

AI and machine learning models that rely on unbalanced data sets, such as fraudulent sales or converting visits in a luxury setting, can benefit from synthetic data to be trained with much more accuracy. The more cases you have, the more detailed the model can be.

Figure 3. GAN used to generate data to train an estimator.

When past data is no longer a reliable pointer to the future, synthetic data can help prevent dramatic decline in AI model accuracy.

Most of the time, AI and model training is based on past data. Many models will fail to deliver if the ecosystem from which they learn changes too quickly. The COVID-19 pandemic is a prime example of such a change whereas pre-pandemic experiences and data points are no longer valid to predict the near future. On a philosophical level, synthetic data frees AI from the constraints of simply looking at learnings from historical data.

Figure 4. GAN used to train an estimator with synthetic data based on COVID-19 Assumptions.

In fact, Gartner predicts that by 2024, the use of synthetic data will halve the volume of real data needed for ML — accelerating data-driven innovation.[4]

Additionally, according to Gartner analyst Svetlana Sicular, by 2024, 60% of the data used for the development of AI and analytics solutions will be synthetically generated.

Figure 5. Data used for AI over time (Gartner).

Enabling Faster, Privacy-Compliant Data Sharing

Synthetic data not only facilitates data sharing within enterprises, but also allows data sharing outside of the organization.

With GDPR, CCPA, and other growing privacy legislations throughout the world making exchanging personal information more difficult, if not impossible, synthetic data is critical to facilitating collaboration.

It also enables engagement with external technology partners and leading labs around the world to create state-of-the-art solutions.

As an e-commerce, synthetic data enables exchanging data with your vendors. This type of collaboration can translate directly into more sales. Cooperative promotions and advertising are good examples. Both brands and retailers can contribute synthetic customer data and inventory to ensure promotions are optimized and carried out without a hitch.

Tech heavy companies can also create new revenue streams by selling generated data as 2nd party data. An avenue that is getting more and more popular in the wake of a cookieless future.

According to Gartner, firms that share data with their partners externally earn three times more measurable economic benefits than those that do not.[5] Synthetic Data is the perfect tool to secure this edge.

Figure 6. GAN used to share synthetic data with external partners.

Synthetic Data as a Simulation Tool

Certain methods to generate synthetic data, like InfoGAN, Conditional GAN and Auxiliary-Classifier GAN, will let you introduce information in the generation process to alter the generated data.

Altered data may sound like bad news, but by strategically modeling their business and their ecosystems, companies can develop synthetic business scenarios.

Figure 7. GAN used to simulate a sale scenario based on potential inventory data.
Figure 8. GAN used to simulate a sale scenario based on potential inventory data.

Say you built a GAN that generates daily sales data. By embedding inventory data in the sales data structure, you could run simulations to evaluate sales given different inventory assortments. The same could be done for your marketing budget, your product pricing, and much more.

This technique, paired with optimization and computing power, would allow you to search for the most profitable scenarios in a massive universe of possibilities.

Figure 9. GAN used in inventory optimization based on simulated sales scenarios.

Implementing this technique is a little more complex than I make it sound. But if you manage to develop such a tool, it can supercharge key growth activities like market and category expansion.

Conclusion

As tech leadership and data SMEs grow their understanding of synthetic data, there will be new demand beyond the current main use-cases of assisting AI training, privacy compliance, and business simulation.

Synthetic data has the ability to support digital transformation, improve efficiency, and speed innovation. As this becomes more widely recognized, more industries will see the business benefit of utilizing it.

If you are left wanting more, I’ll walk you through the implementation of a GAN in a follow-up article. In the meantime, like and follow!

References

  1. https://proceedings.neurips.cc/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Paper.pdf
  2. https://arxiv.org/pdf/1411.1784.pdf
  3. http://proceedings.mlr.press/v70/odena17a/odena17a.pdf
  4. https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/
  5. https://www.gartner.com/smarterwithgartner/data-sharing-is-a-business-necessity-to-accelerate-digital-business

Editorial reviews by Catherine Heim & Mario Bittencourt

Want to work with us? Click here to see all open positions at SSENSE!

--

--

Gregory Belhumeur
SSENSE-TECH

I build AIs, models and algorithms that make our competitors think we're using cheat-codes --- Principal, AI/ML @ SSENSE + Partner @ Beaucoup Data