Data Collection Strategies for Building an AI Chatbot

Published in

IBM Data Science in Practice

5 min readSep 20, 2019

As chatbots continue to be in high demand for companies who want to automate or provide 24x7 customer service, many still struggle with understanding what is necessary to get a chat solution off the ground. Simply put, AI needs data. However, not just any data will do.

When you build a chatbot, you must train it to understand the questions or requests that are most likely to be posed to the solution. This is the most common misunderstanding I encounter. Many organizations have plenty of “data” in the form of answers that they wish to provide within a chatbot. However, they often lack the most crucial data needed for training a chatbot: examples of how users express their goals and needs (intents).

I’ve seen more than one project get delayed due to a lack of training data.

AI chatbots are trained using inputs, they are then configured to provide outputs (answers or responses). So, the training data must be comprised of examples (a.k.a. utterances) of users asking questions or making requests. These utterances are used to train a machine learning model. Once the model is trained, it should be able to classify the intent of a request, even if the wording isn’t exactly like the examples it has already seen. This is the true power of AI.

It is also critical to have representative training examples, which means:

Customer Voice

Data must be collected from the same type of end users targeted for the solution (NOT subject matter experts, NOT developers, NOT executives). The questions must be expressed in the voice of the user, using their vocabulary and phrasing.

When data is collected from SME’s, developers, and other people close to the project, they often introduce bias in their terminology. They submit questions with an idea in mind about the expected response. They often lack the background and real-world circumstances that drive users to engage with a chatbot solution.

Channel Compatibility

Data must be collected from the SAME channel as the deployed solution (a text message solution must collect training data from text messages, a webchat solution must collect training data from chat logs, an IVR solution must collect training data from a voice channel).

Data from dissimilar channels is generally not suitable. The primary reason is that people express themselves very different when they speak, compared to how they type. (And people express themselves differently over email vs. text message, etc.)

Outside Factors

Be aware of how seasonal events can influence the scope and frequency of certain topics. If your candidate training examples are harvested from a particular point in time, some topics may be under-represented or over-represented.

Do outside factors such as holidays, tax season, open enrollments, year-end processing, etc. impact the kinds of questions your users may have?

I once worked with a banking client to build a chatbot that answered general account questions. The company offered an assistance program for loan customers who may have been impacted by hurricanes. This was a high-volume request, but only during certain times of the year. If our source chat transcripts had not covered a broad enough time frame, we might have missed this topic altogether.

So, How Can I Get Started?

Obtaining enough training data can be a challenge, especially in the early phases of chatbot design. There are essentially five options you can choose from. These are listed from most preferable to least preferable.

The more preferable approaches (1 & 2) will give you better predictions about how the chatbot will perform once it is launched. They will also help you identify the best areas (i.e. the most frequent topics) to invest your efforts, which can guide you in building a solution that delivers the most business value.

Without appropriate planning, the less preferable approaches (4 & 5) often result in unpredictable or poor performance. Sometimes these options are unavoidable, so read the caveats and be prepared for some immediate improvement phases. A bot that is not equipped to handle the range of requests it encounters from real-world users usually does not deliver business value (and it really frustrates users).