Improving Customer Support with a Machine Learning-based Receptionist

Published in

Blog Técnico QuintoAndar

9 min readDec 6, 2021

Figure 1: Connecting customers to departments and agents at QuintoAndar.

As part of a real estate company, customer support at QuintoAndar has to deal with hundreds of different types of contacts involving individuals as distinct as tenants (and prospective tenants), landlords, agents (realtors, inspectors, photographers, etc) and building administrators. Issues range from simple questions about using our site to search for a house to requests for intermediation of complex tenant-owner issues like negotiating a temporary rent reduction.

This means that matching an user to the right specialist is not an easy task. On the one hand, presenting a menu with a multitude of customer support departments may be time-consuming and ineffective, as users may not be familiar with each department’s responsibilities and are likely to make mistakes when choosing an option. On the other hand, letting the users freely declare what they need has its own drawbacks, such as the possibility that the user’s explanation does not have sufficient context regarding their demand and the issue of how to convert that explanation into a decision about the destination department of the contact.

For WhatsApp — our main support channel — this issue gets trickier, though: the app is ubiquitously used in Brazil as a means of communication with family and friends. Contacts, thus, may start with a simple “Hi!”, with a complete account of the user’s issue or with a partial description that refers to past contacts.

To improve accuracy of ticket routing and speed of support, also automating the collection of all necessary information before a human analyst is allocated to the ticket, our team developed a chatbot system that acts as a receptionist for customer support. The implementation of such service brought positive results to QuintoAndar, allowing the full automation of chat triage.

Beyond presenting our chatbot architecture, our goal is to show the results achieved by state-of-the-art models in a real-world application. Much has been discussed about the fantastic results achieved by recent Natural Language Processing (NLP) models for the English language both in academic literature and in industry. How this transfers to Brazilian Portuguese is not as clear, though. By sharing our experience, we hope to shed some light on how Portuguese-speaking companies and research groups may use these technologies outside common benchmark problems.

A Look into the Past

Prior to the receptionist bot, triage and department selection was carried out by a team dedicated solely to verifying that sufficient context was provided, asking users to give more information if necessary, and directing each chat to the appropriate department. Of course, this was a far from ideal situation. A large number of skilled agents were diverted from customer troubleshooting to perform a highly repetitive task that added latency to proper chat routing and stiffened any processes changes that incur in changes in departmental responsibilities, as any of those changes would require retraining the triage team.

Creating an Architecture for a Receptionist

To sustain company’s scalability, our team developed a “receptionist” chatbot that should be able to:

Check for context: Determine whether the user has provided enough information in their initial message, requesting a better description if deemed necessary.
Predict contact reason: Combine this textual information with data about the user’s relationship with QuintoAndar to predict the reason for the current contact.
Gather data: Automate the collection of basic data, such as information that helps to identify users who could not be identified based on their phone number and data to help the agent verify the authenticity of the user.
Ticket routing: Use business rules together with the predicted contact reason to determine to which department the chat should be directed.
Assist the agent: Provide the selected agent with the predicted reason for the contact, facilitating access to the appropriate guidelines for that service and speeding up the mandatory annotation after the conversation is finished.

This was not without its hurdles. First, there were more than 300 standardized possible contact reasons with a substantial gray area between them, ensuring a great deal of noise in labels. Classes were highly unbalanced, with some containing only a handful of chats in a given year, while others classes totaling thousands in a single week. Finally, as a Portuguese-speaking company, there were no open resources available related to customer service, so we could rely only on pre-trained models and our own private datasets to develop the chatbot.

Following the dialog-state architecture, we’ve built our bot with the following basic components:

Handlers: processors used by the chatbot to extract relevant information from input messages or make decisions about what to do next.
Finite-state automaton: a recipe that indicates which handler should be called at each state and how to use a policy handler results to select the next state.
Dialog memory: a flexible memory to store all relevant information produced by handlers during the conversation, including a special field indicating the automaton’s (see below) current state.

Handlers, in turn, can be of two different types:

Message processing handlers: responsible for extracting information from input messages and storing it on dialog memory.
Policy decision handlers: responsible for deciding what the system should do given the information stored in dialog memory. Examples of possible actions are: (i) performing some calculations and storing their results in the memory, (ii) sending a response to the user or (iii) transferring the chat to a certain department.

Each handler has its own implementation, which may range from simple application of business rules to complex ML models and third-party services, as exemplified in the figure below:

Figure 2: Example of conversational flow. Message information is extracted by message processors, while decisions are made using policy handlers.

Classifying Contact Reason

We have two different machine learning models used in this conversational flow:

Context evaluation model: used to predict whether or not a particular user message was sufficiently explanatory. For more details about this model, please refer to our full paper.
Contact reason prediction model: model that combines the user’s message with tabular information indicating their relationship with QuintoAndar to predict how likely the contact is to be related to each of 306 possible standardized contact reasons (as seen in the figure below).

Figure 3: Examples of contact reasons that our chatbot is able to handle. Each one is dedicated to a type of customer: the first one to brokers, the second to photographers, the third to tenants, and the last is dedicated to house owners. We have 306 possible contact reasons.

The association of text and tabular data is important as a way to avoid requiring users to explain their whole history to the chatbot. Consider the phrase “I need to cancel the visit tomorrow.”. It is a complete yet ambiguous message: is it related to a photographer that wants to reschedule a photo shoot, to a potential tenant who is not interested in visiting an apartment anymore, or to a real estate agent that will not be able to present some house and wants to leave it to another colleague? Only by having access to the user’s relationship with QuintoAndar the model would be able to accurately predict the contact reason without requesting further context. Therefore, we used 66 handcrafted features available from our feature store to help setting the context of the text messages. Some examples are the type of the last automatic message sent to the user (and time since it was sent), the contact reason for the last ticket (and time since it was created), whether the user is a registered agent, number of rented houses (as owner) and number of ended contracts (as tenant). These features are pre-processed, performing one-hot encoding on all categorical columns and scaling the numeric ones to have zero mean and unit variance.

Figure 4: Combining textual and tabular features for contact reason prediction.

We treat both feature groups (textual and tabular) as different modules that are fed into a separate classifier. In our first version (V1 — bag-of-words), we used a simple unigram bag-of-words to extract features from messages. With support of an AutoML tool, we chose a multi-layer perceptron as the classifier model to predict contact reason. In the next table, you can notice the relevant gains we have by combining textual information with tabular features:

Table 1: Comparing the effect of using tabular data combined with bag-of-words features.

After analyzing the first model errors, we noticed that, despite showing good results, it had trouble with synonyms and complex sentences. In a second version of the model (V2 — BERT) we addressed this issue by using for textual feature extraction a representation extracted by a Portuguese version of a BERT model fine tuned on our historical dataset. Using a transformer-based architecture allowed us to better interpret user messages.Even after this improvement, we could benefit from complementing it with tabular data:

Table 2: Comparing different extraction methods for textual information and effect of using tabular data.

Business Impacts

To assess the results in production, we have collected data from a triage human team and for a set of heuristics rules that routed simply based on the last automatic message sent to the user (e.g., if the message was related to a visit, then the user was directed to the visits department). The business metric that we decided to follow to compare was the transference rate — i.e., the rate in which a chat routed to a given support department is transferred to another department (lower is better). As a secondary metric to evaluate how fast an agent is able to solve the user problem, we also measured how many messages were exchanged per chat.

We have run tests in production comparing human triage and both chatbot versions. As a safety measure to avoid deterioration of user experience, we only automatically routed the 80% of tickets with highest department score leaving all low-confidence tickets to human triage.

Table 3: Comparison of results for automatic triage for 80% of all tickets.

As the results were similar to human performance, we expanded it to 100% of our clients base. You can see our results below:

Table 4: Comparison of results for automatic triage of all tickets.

Considering these results, we can easily say that the BERT embedding combined with tabular data achieved human-level performance. The number of message exchanges until the conversation was ended (avg. message per ticket) was also substantially smaller for clients routed by the chatbot than for those routed by humans. An hypothesis regarding the reduction in message numbers is the fact that by using tabular features the model has access to a large amount of information not easily consumed by the human triage team.

These results in a real-world application using business metrics endorse the potential of modern NLP techniques for Brazilian Portuguese. We hope that results like these help other companies to see that it’s now feasible to go beyond the widespread rigid conversational interfaces based on buttons or simple keywords.

Next Steps

This paper reflects the work developed in 2020. Our architecture and models have evolved a lot since then: for instance, we experimented with machine reading comprehension models and new conversational designs. We are eager to share our results in future papers and blog posts. Stay tuned to learn more about how we use machine learning and natural language processing at QuintoAndar!

Acknowledgement

This work was written in collaboration with André Barbosa. The presented work was also developed by Marco Vinha, Vitor Cabral, Pedro Amancio and Katy Miranda. We would like to thank Muriel Dias for producing the images in this post.

Learn More

You can find more details regarding our chatbot architecture and models both for context evaluation and for contact reason prediction in our full paper. This paper has been accepted for oral presentation at the Symposium in Information and Human Language Technology (STIL 2021).