Using Semi-Supervised Learning to Label Large or Complex Datasets

Alexa Tipton
The Emburse Tech Blog
5 min readOct 19, 2022
Abstract background with interweaving of colored lines and dots. Network connection structure. Data exchange. 3D rendering.
Credit: iStockphoto/Olena Lishchyshyna

Machine learning and artificial intelligence have been widely defined as the building of processes to replicate human cognition. This is often further romanticized as a future path forward toward the development of systems with human understanding, consciousness and sentience. However, achieving this future requires a significant investment of human effort to clean, tag, annotate and label data in order to enrich it to the level of quality required to produce useful machine learning models. Depending on the model, often millions of observations are required to be labeled. Leveraging other machine learning techniques, such as unsupervised learning, to accelerate labeling data for training, is an approach that has proven successful for our team at Emburse.

What are Supervised and Unsupervised Learning?

Supervised learning is a machine learning technique that uses labeled data to train a model. For categorical or regression models, large amounts of labeled data are necessary to develop a robust model. Labeled data is used to teach the model what sort of features and patterns it can expect in the data. A larger dataset is not only more likely to include rarer, fringe cases, but it also serves to reinforce what the model learns, helping it “study” through repetition, so to speak. How much data is necessary to properly train? It can be difficult to say, but the general philosophy is that more is more. Hundreds of data points are fine, but hundreds of thousands are much better. Supervised learning relies on as much high quality, labeled data as possible to produce the most accurate model. As we say in data science: garbage in, garbage out.

Unsupervised learning might seem preferable in the face of manually labeling thousands of data points, but it unfortunately involves solutions of a different nature — generally pattern-seeking and clustering algorithms. That is to say, if you have a categorical problem, there is simply no way around labeling data, and a lot of it.

Manually Labeling Data

Labeling data can be one of the biggest hurdles to creating a machine learning model. It can be tedious, time consuming, or even borderline impossible, depending on the size and scope of the dataset. There are various ways of labeling data, but most involve some degree of manual labeling. This could be as simple as going row by row in Excel and typing a label, or creating a web interface to quickly point and click at labels. For a few thousand rows of data this is a tedious task. For hundreds of thousands — or even millions — of rows, this is an impossible task.

More data leads to more accurate models, so it makes sense to want to maximize the training set, but the daunting task of labeling may make you want to reduce the amount of data. Even for just a few thousand rows, complex data can be very slow to label. If you have to spend 20 seconds evaluating each data point, 3,000 data points can take 16 hours to label.

Semi-Supervised Data Labeling

Unless you are able to use a pre-trained model to label your data (depending on your needs — and budget — you could have luck with something like GPT-3), any method is going to involve some degree of brute force. The idea is to minimize this by creating a model to label for you:

Label a minimal amount of data > train a model > use the model to label data > check for accuracy > Retrain model with more data.

Getting Started

For my purposes, I had a dataset that was not only very large, but also very complex. Each data point took a human around 30 seconds to read through and make a decision. Because my data was very complex, I decided to use a deep learning model. However, this method can, in theory, be applied to any model and any level of complexity in your data.

The first step is to manually label a small amount of data. Because I had a fairly balanced dataset, I chose a few hundred rows essentially at random. If your data is very unbalanced, you may want to be more careful with your selection.

These are the results from the first round of training:

                +---------------+-------------------+
| Test Score | 0.613903284072876 |
| Test Accuracy | 0.75 |
+---------------+-------------------+

Not bad, but of course the accuracy is questionable given the very small sample size. In addition, data quality and the selection of your model are very important. Rich, high quality data applied to the right model will, of course, yield the best results.

The Grind

Now that you’ve labeled some data and you’ve (hopefully) gotten some promising initial results, it’s time to apply your model to fresh data. For my second set, I chose a few hundred more, also at random. After predicting on new data, I manually reviewed the results, making corrections along the way where the model had mislabeled. On this data, I found that my model performed at closer to 83% accuracy, a pleasant surprise.

Newly labeled data was joined with the first set, and the model was retrained with the following results:

              +---------------+--------------------+
| Test Score | 0.6114386320114136 |
| Test Accuracy | 0.7941176295280457 |
+---------------+--------------------+

Now you simply repeat this process until your model has an accuracy you are comfortable with, using datasets as large or small as you have time to manually review.

Results

For me, I knew I was finished when I achieved accuracy in the 90s. When tested, I found my model to be about 93% accurate. This is similar to accuracy achieved using human labelers. In the end, I had repeated the process five times on about 6,000 rows of data. I was then able to label ~250,000 lines of data which I used to great success in building another model.

Final Thoughts

The mission of Emburse is to humanize work through automation and prescriptive insights to optimize enterprise spend and expense management. Machine learning is an integral part of the strategy to create products that deliver on this mission and delight our customers. Emburse, as a data steward serving nearly 10,000 companies worldwide, has an extremely large and rich repository of corporate spend data to fuel machine learning research and development. However, leveraging those data assets requires a large investment to tag, annotate and label that data in order to produce useful models. Using this approach, we are able to be more efficient in order to achieve more.

--

--