Augment your Hugging Face model with Human-in-the-Loop

Continuously improve your models by adding human-in-the-loop capabilities with Amazon SageMaker Augmented AI and Pipelines

7 min readNov 11, 2022

Since early 2021, when HuggingFace partnered with Amazon and the HuggingFace libraries became natively supported on AWS’s SageMaker platform, it has never been easier to fine-tune and deploy language models on the cloud. With only a few lines of code, data scientists and ML engineers can fine-tune their HuggingFace language models leveraging the cloud capabilities and when ready, deploy their models on fully managed scalable or serverless infrastructure.

More often than not, such large pre-trained language models achieve impressive performance results with only a few task-specific examples. However, when models are used in a production environment, the expectations always tend to raise over time and your users or business stakeholder expect less and less “mistakes” from the models.
At that point practitioners are faced with a challenge: “How can we make our models improve over time based on real-world feedback?”

In this post, we explore how to close the loop by adding the human reviewer into the process, so that they can give feedback (annotate real examples) that can immediately be used to further tune & improve the model performance seamlessly. The use of Amazon SageMaker Augmented AI is a capability of SageMaker that will allow us to achieve this objective.
The solution will follow the pattern as per the diagram below.

Schematic of project built in this post | image generated by author

All code displayed in this blog can be found in the GitHub repository.

The use-case:

For this example, we are considering the relatively simple example of sentiment classification of arbitrary sentences. We will use the distilbert-base-uncased model from the HuggingFace library as our base model, which we will later fine-tune with more examples that will come from our human reviewers.

The model with be created through a SageMaker Pipelines pipeline and the trained model will be deployed from the Model Registry into a real-time endpoint.

When the model makes a prediction where the sentiment prediction is not very clear (neither clearly positive nor clearly negative) then the sentence will be sent for human review.

Once reviews have been concluded or the decision is made for other reasons to retrain the model, the original pipeline will be triggered again and will produce a fine-tuned model, based on the new annotations now available.

The training Pipeline

Our training pipeline consists of 3 steps: processing, training, and registering the model. The processing step is splitting the data (using the imdb reviews dataset for the initial fine-tuning) and also adding our human review annotations into the training dataset. In the training step we fine-tune the model and finally we register the model in SageMaker Model Registry.

You can find the code for this step in the notebook 1_create_model_pipeline.ipynb

Once executed, your pipeline will look like the one below:

SageMaker training pipeline | image generated by author

Human review setup

For the human review we leverage SageMaker Augmented AI, that is extending the capability of SageMaker Ground Truth.

The first time you use this, you need to define the Labelling Workforce. For this example we will create a private workforce (other options are to use a 3rd-party vendor or Mturk). In this private workforce, set yourself as the reviewer, or add your colleagues that will be helping you with the labelling/review process. Once you set this up, you will have access to a portal where any available job assigned to you can be accessed and completed.

Next, specific to our use-case you need to define the template that will be used to render your review page. Thankfully, there is a library of available, pre-made such templates so you won’t be wasting time starting from scratch.

The above steps are described in details with step-by-step screenshots and code in the notebook 2A_a2i_setup.ipynb

Model deployment

For the model deployment, we will deploy the model behind a real-time endpoint. Since the model was trained on SageMaker and registered in the Model Registry, the deployment will be quick with only a few lines of code. The notebook 2B_deploy_model.ipynb is showing exactly how to do this.
Please note, that real-time endpoints are charging you for their availability, so once you are done, please remember to shut them down (code for this in the above notebook as well).

In this example, we are deploying on an ml.m5.xlarge instance which with on-demand pricing would set you back $0.257 per hour in the Ireland region (eu-west-1) which I am using for my testing.

Generate traffic

In order for us to test our solution, we need to generate some traffic to our endpoint. Follow the instructions of notebook 3_generate_some_traffic.ipynb to do this. But first, a couple of things to point out.

So far we have not connected in any way our endpoint with the human review process. This is exactly what we have to do at this stage.

Always, when deploying your model behind an endpoint, there should be some other service/application/compute resource that is using/consuming the model. Depending on the architecture and pattern you are following, this could be a lambda function behind an API Gateway, or the backend of another business application in your organisation, or something else. In any case, that compute resource, will need to consume the endpoint, get the prediction, apply some arbitrary logic to decide whether human review is required (in this example based on a Confidence Level) and optionally kick off a human review.

In our example the function e2e_prediction() first runs the function make_prediction() and then the check_for_human_review() that will start a manual review process if the model was not very confident about the detected sentiment.

Let’s test it out with a sentence that the model will not be very confident with, “I bet I now understand how SageMaker works!” which in human terms we can classify as Positive. Running it through the model triggers a human review process for the sentence.

>>> e2e_prediction("I bet I now understand how SageMaker works!")
Model is thinking that sentence is: Positive but will ask a human to verify
Score is 0.5773128271102905 which is less than the threshold of 0.7
Starting human loop with name: f0a33b75-6734-4229-aa44-ce577d525d07

Once you’ve run the above, the human reviewer portal will be immediately updated with new annotation tasks and you can immediately start reviewing!

Human reviewer panel | image generated by author

Retraining model

To retrain your model, you can run the code in notebook 4_retrain_model.ipynb or simply, from SageMaker Studio console, start a new execution of the pipeline as per the image below.

Create new pipeline execution from Studio UI | image generated by author

In this scenario, we are manually triggering the re-run of the pipeline. In an operationalised scenario you need to think what is the right re-training strategy for your use-case, always trying to strike a balance between increased performance due to fine-tuning and cost of the fine-tuning. The easiest way to automate this is through a scheduled or data-driven execution leveraging the native integration of SageMaker Pipelines and EventBridge.

Note that the model will automatically pick-up the newly annotated data to fine tune on. This is because when we initially created the pipeline, we set the input of the processing job to be the same S3 location that we set as the output for the human review task.

When implementing this for your own use-case, you might have a different type of data you are annotating. In that case, the output of the human review process might have a different JSON structure. In this case, you’ll also need to amend the preprocessing file, so that it properly parses your annotation files.

Back to the example from before, if you now run again the same sentence as before through the model you will see the model predicting more confidently that that is a positive sentence.

>>> make_prediction("I bet I now understand how SageMaker works!")
('Positive', 0.7190687847137451)

The fine-tuning worked!

Conclusion

Image generated with Stable Diffusion model with prompt: “A human reviewing predictions of a machine learning model”

In this post, we went through an end to end code example of how we can incorporate a human reviewer into our ML workflow and we saw in action how we can continuously improve our model. With such a workflow you can really leverage the power of fine-tuning these large language models for your use-case and continuously improve them as you are getting more examples of real life usage of the models.

I hope this gave you ideas of how to keep improving your models and making them more and more valuable for your organisation!

Reach out to me if you want to discuss your use-case and how you can bring your own workloads on the cloud!

If you liked this post, do let me know in the comments below, follow me, and feel free to suggest further topics that you would be interested to read about