Member-only story
The Magic of Synthetic Data
Using Artificial Intelligence to Train Artificial Intelligence with GPT-2
Recently while working on a NLP project I ran into a problem. I didn’t have enough data. Classification Models use input data to predict the likelihood that the subsequent input data will fall into predetermined categories. To perform effective classifications, these models require large datasets for training. So I created a method of using Artificial Intelligence to generate relevant synthetic data that would improve performance of my Classification Models. This method resulted in my Baseline Model’s Accuracy being increased by 9.49% and Precision being increased by 7.63%.
Synthetic Data Background
It is becoming common practice to utilize synthetic data to boost the performance of Machine Learning Models. It is reported that Shell is using synthetic data to build models to detect problems that rarely occur; for example Shell created synthetic data to help models to identify deteriorating oil lines.(Higginbotham, 2020) It is common practice for Machine Learning Practitioners to generate synthetic data by rotating, flipping, and cropping images to increase the volume of image data to train Convolutional Neural Networks.