Using GPT to Assess the Quality of Open-Ended Responses: The Good and the Very Good!
As someone who spent more than 250 hours (six weeks!) cleaning data last year, I am very motivated to find a solution to the data quality challenges that we face in our industry.
Despite trying various tools specifically designed to evaluate the quality of open-ended responses, I have yet to come across one that can effectively analyze the content of a response beyond what is obviously bad (e.g. gibberish, profanity, etc). Given how well ChatGPT understands context, I had high expectations that it would be able to efficiently clean open-ended responses.
Because I am a woman of science, I had to put this to the test, and it worked beautifully!
Background
With the help of a professional techie, it took about 4 hours to set-up a tool within Google Sheets. The tool uses the new Chat completion API and gpt-3.5-turbo model from Open AI. This new API works great for this problem for a couple of reasons:
- First the API provides a much clearer way to set the “system” context which is what tells GPT what it should act as. In my case, we set the system context as: “You are a helpful survey research assistant here to help me detect fake survey responses to the question: <full question here>. Please always start your answers with “Yes” if you think the response is real and “No” if you think it’s fake.”
- The API also provides a built-in way of providing examples of requests and responses that should be expected to “train” the model right within the context of that single call. This means no need for creating a fine tuned model ahead of time — that’s huge.
- Second, the new model is more powerful than the GPT-3 Davinci API that I had used before, but costs 1/10th the price, so one can give a lot more context to each request with very little cost.
- Last but not least, as the gpt-3.5-turbo name of the model suggests, it is very fast, so every prompt to check a question takes under 1–2 seconds!
We used Google sheets Apps Script to build a custom VERIFY_RESPONSE() formula that calls the OpenAI Chat completion API to receive a YES/NO assessment along with a reason.
The result
Overall, we had 785 open-ended responses that we wanted to assess for quality. We used 23 responses as training data to help the GPT model learn to generate a more accurate assessment to similar open-ended questions.
Although the tool performed well in assessing the open-ends based on my training, it was rejecting too many responses that I personally deemed acceptable. In selecting the best open-ends for training, I inadvertently created a model with standards that were too high. In order to fix this, I added shorter and less sophisticated responses in the mix.
After including a wider range of acceptable responses (now 35 OEs), the tool was able to match my own assessment with a 90% success rate. This means that 9 times out of 10, GPT and I agreed that the response was either good or bad. Our main disagreement was around “no comment” and “don’t know” type of responses that I am sure we would be able to resolve with a little more training.
In terms of time, it took roughly 20 minutes to train the model. Reading through these open-ends took 5 hours so it is a huge time saver. The total cost of this experiment came to $6.76.
Conclusion
The potential for using GPT to efficiently clean survey data, particularly open-ended responses, is huge. But there is more: Integrating the API directly into the survey script would allow us to immediately reject participants who provide inadequate open-ended responses in real-time, reducing the need for manual screening. Stay tuned for more research in this space.