Building a Takealot customer service chatbot with generative AI
Co-authors: Schroodinger and Ceasor Mufamadi.
Takealot has built its first chatbot using generative AI, enabling a software engineering team (i.e. not one of the machine learning teams, but still under machine learning guidance) to harness the power of large language models (LLMs).
The rapid growth of open-source LLMs (as well as those hosted by cloud providers), makes it easier than ever to develop generative AI applications. The advent of LLMs has changed the landscape dramatically when it comes to use cases which include unstructured data (e.g. text, images, audio and video). Previously, gathering data and training a model was a complex operation. Now, time can be spent on testing instead, and iterations on the final product can be done much faster.
Where is my order? I want a refund!
The Takealot chatbot answers customer questions based on existing help centre articles. This makes for the perfect use case for retrieval augmented generation (RAG). In essence, the knowledge base is split into chunks, and embeddings are gathered (using an LLM) for each of these chunks. These are stored in a vector database. A customer queries the chatbot, and the embedding of the query is matched with the chunk in the vector database. The full text of the chunk along with the query is then provided as input to another (typically bigger and more powerful) LLM, which provides the answer back to the customer.
Guardrails and hallucinations
It is understood that hallucinations are, in fact, all that LLMs do. These hallucinations vary from useful to nonsensical. Using RAG is a great first step to steer the hallucinations into useful territory. It was also necessary to create guardrails so that the chatbot would not respond to questions that were outside its help centre knowledge base. A lot of time was spent on creating these guardrails and getting the prompt design right. In particular, providing clear, concise and short prompts proved to be a winning strategy.
The challenge of navigating a rapidly evolving landscape
After a bit back and forth on software packages and open source models, the project solidified with LangChain, using ChromaDB as the vector database, and all-MiniLM-L6-v2 to create embeddings for each chunk. GPT4All was used to harness the power of Mistral-7B-Instruct as the final output LLM, which was the prevailing open source model at the time. This model could be run on a developer’s laptop, although the response times were painfully slow (a few minutes).
We opted to self-host the Mistral model on a GPU on Vertex AI to serve customer traffic. The cost was roughly $400/month for a single instance. Just a few weeks before the go-live date, Gemini Flash was announced, which dramatically altered the playing field. The cost dropped to a point where it became absolutely pointless to self-host. We instead opted for Gemini Flash, which was faster, cheaper and better in terms of responses given the same prompts as the Mistral model.
Conclusion
The underlying goal, which was to enable software engineering teams to develop generative AI applications, was a success.
This opens up for a lot of potential in the immediate future for the software engineering teams at Takealot. It also seems like the generative AI landscape (in particular the APIs) seem to stabilise on Vertex AI, making it easier to make long-term decisions on which technology to use.