Data Science MSc projects with Gousto — Part I

Steven George
Gousto Engineering & Data
11 min readNov 3, 2022
L: Sagar Kishore R: Ishaq Gunawan

This summer Gousto’s Data Scientists once again collaborated with Masters students undertaking courses in Data Science and Advanced Analytics to work on company-specific projects. This programme has grown from 7 students last year to 17 students this year and we plan to continue this trend going forward!

In this series of blog posts we chat to some of the students we worked with in recent months to find out about their background and the exciting projects they worked on with Gousto.

Please introduce yourself and tell us about the course you are studying

Sagar: Hello, my name is Sagar Kishore. I have two years of experience as a Machine Learning Engineer. I wanted a better understanding of the models I was deploying, so I joined the Masters in Data Science program at Lancaster University. It is a one-year program that primarily focuses on the fundamentals of machine-learning algorithms and offers numerous pathways that explore more advanced techniques in Data Science. I opted for the Computing pathway, which consisted of Big Data Systems, Distributed Artificial Intelligence, and Applied Data Mining modules. It has been a challenging yet satisfying experience overall.

Ishaq: Hello, I’m Ishaq Gunawan and I am a recent graduate of the MSc data science programme at Lancaster University. Prior to undertaking the MSc programme, I had undertaken the BEng Computer Science and Electronics programme from the University of Bristol, during which I have found my passion for machine learning and data science after taking a machine learning unit in my second year. Following from this, I had decided to pursue a Masters degree in data science, which has led to the programme I have recently completed. What attracted me to this MSc programme in particular is that the course covers a wide range of key AI and machine learning topics such as computer vision, natural language processing (NLP), and reinforcement learning.

Because it’s the Gousto blog, we must ask you what is your favourite food?

Sagar: As an Indian and a lacto-ovo-vegetarian, I have a plethora of mouthwatering options to choose from, but my favourite has to be my mom’s Kadai Paneer I used to eat every Sunday. It is a curry made with a rich, tangy and spicy tomato puree and contains chunks of peppers, onions and paneer tossed in powdered Indian spices. It is the perfect combination of flavours I love, and I have missed it every week for the past year.

Ishaq: I am a huge fan of Thai cuisine because all of the dishes have such a strong and robust flavour profile, in particular I absolutely love a good hearty bowl of tom yum goong at the end of a long day. A hot, spicy, and sour dish is just what I need to rejuvenate my weary body.

What first got you interested in data science?

Sagar: My first exposure to Data Science was back in 2016 through a friend who recommended Andrew Ng’s Machine Learning course on Coursera. Before this, I had a basic understanding of data science and machine learning. This course allowed me to understand how powerful these techniques were and their potential to solve problems in a variety of domains. However, I truly fell in love with this domain during my work as a Machine Learning Engineer at Scientist Technologies, where I got exposure to Deep Learning techniques applied to Computer Vision and Natural Language Processing subfields.

Ishaq: During my BEng degree in Bristol, I attended an on campus talk by LV=GI on how data science is being deployed into the insurance industry to combat insurance fraud and to produce more accurate risk assessment analysis. The talk made me realise the breadth of influence that data science and machine learning have upon various commercial industries, and this has influenced my decision into specialising my multidisciplinary engineering degree of computer science and electronics towards a data science and machine learning specialisation route in my final year.

In my final year, I had undertaken in a major data science group project that was done in collaboration with the Alan Turing institute which involved developing a machine learning model that could predict the food insecurity level of a given household as well as to determine the factors that affects food insecurity from a given household survey data set. We had written a medium article for the project which outlined our methodologies and approaches, please check it out :). The successful completion of this group project had further affirmed my passion for data science and has led me to wanting to pursue a full-time career in data science.

What was the problem your project with Gousto was aiming to solve and what different techniques did you get to try?

Sagar: Recipes are the core product at Gousto, and as a data-first company, Gousto has curated a comprehensive recipe dataset. This dataset includes the fundamental features such as ingredients and steps, but also a handful of recipe attributes such as the cuisine type, the main protein and carbohydrate of the recipe, and dietary claims, among others. Crucial data-driven processes, such as the forecasting algorithm, recommendation system, and menu planning algorithm consume recipe data. Due to the heterogeneous nature of these downstream tasks, we require a robust representation of recipe data.

My project explored various techniques to extract text, image and multimodal embeddings from the recipe name, set of ingredients and recipe steps, and final recipe image. To determine the robustness of these embedding representations, they were evaluated on two sets of tasks — predicting recipe attributes such as protein category and cuisine type (classification) and predicting the popularity of a recipe (learning to rank).

With regards to the techniques, for text embeddings, I used Sentence BERT-based models, and for image embeddings, I used a ResNet50 model pre-trained on the ImageNet dataset. For classification tasks, I used LightGBM classifier, and for learning to rank, I used XGBoost ranker.

Ishaq: Customer-relationship marketing has become a crucial business strategy for establishing a loyal customer base, which is a key factor towards the commercial success of companies operating in the business-to-consumer sector such as Gousto. Due to this, Gousto endeavours to deliver proactive, industry defining customer care and retention services to provide customers with a highly personalised and convenient experience of using Gousto.

Currently, Gousto’s customer care agents have to tag each customer complaint that comes through their email communication channel manually. This has led to the motivation for this project: “Automatic Tagging Of Customer Complaints Using Natural Language Processing”, where the primary aim is the development of an end-to-end NLP-based data science pipeline with a text classification model that could classify new incoming email complaint contacts automatically and accurately. This text classification model will help Gousto’s customer care agents to focus on solving the problems outlined in the email complaint contacts itself instead of investing most of their time and efforts on tedious administrative tasks.

During the project, I had the opportunity to utilise different feature extraction methodologies for textual data which incorporates contextualised word embedding models such as GloVe, word2vec, and BERT along with the more conventional Bag-Of-Words (BOW) approaches such as TF-IDF. For conducting the actual text classification task itself, I had experimented with using dense and convolutional neural network-based text classifiers developed using Keras in addition to the standard machine learning algorithms commonly used for text classification such as logistic regression, XGBoost classifier, and Support Vector Machines (SVMs).

How did you work with the team at Gousto?

Sagar: I primarily interacted with Steven George on this project. We had weekly/bi-weekly meetings where I would showcase my progress, discuss the next steps and clear any doubts regarding the project. Apart from this, I was added to a Slack channel with Steven, where I could ask questions at any point.

Steven’s insights were crucial to the success of this project. He offered precise feedback and possible next steps to explore that helped me navigate the challenges of training deep learning networks.

Ishaq: I had the pleasure of working with Rita Figueiredo, a Senior Data Scientist within Gousto’s Data team for this project. Throughout the duration of the project, we held regular, weekly meetings to go over the weekly progress I had made with regards to achieving the aims and objectives of the project, along with discussing any issues I had faced and trying to formulate a solution to address them. I very much enjoyed my meetings with Rita because during them we would bounce ideas back and forth and at the end of these meetings I am always left with ideas and inspirations to implement some new approaches and methodologies to my tasks.

Furthermore, Rita also kindly arranged some sessions for me to interact with Gousto’s customer care team directly so that I could obtain additional domain knowledge and context to aid me in the project.

What were some of the challenges you faced along the way, and are there any tips you would give to other data scientists starting their first project with a company?

Sagar: The major challenge I faced during this project was the implementation and training of the multimodal network. Through my literature review, I discovered numerous research papers that proposed multimodal networks that produced multimodal embeddings of recipe data. For my project, I opted for an architecture that used a Sentence BERT model for text inputs and a pre-trained ResNet50 model for image input and trained them/aligned them via a cosine similarity loss function. The network and training loop is written in PyTorch, and since I wasn’t familiar with it at the start of this project, I often read the codebase of the research papers to gain familiarity.

The multimodal network also consisted of over 130 million parameters, and I had to train it on my MacBook Air M1. With each epoch taking an hour to train, it required a lot of patience and trust that the model’s performance would keep improving. The final model was trained for 44 epochs and generated robust representations of recipe data, thus achieving our primary goal.

The one tip I would give someone starting their first project would be to assess the different parts of the project in terms of familiarity. I highly recommend working on the unfamiliar but crucial aspects of the project as soon as possible so that you are not left with only a few weeks to crack something you have never faced before.

Ishaq: The main challenge I had faced during the project was with regards to the quality of the set of labels (i.e. complaint tags) provided by Gousto and the way the training email complaint contacts data have been annotated which has led to the set of labels to be sparsely defined, furthermore some of the individual labels themselves are quite ambiguously defined and a significant amount of them are semantically similar to each other (i.e. they are more or less referring to the same type of complaint) resulting in the text classification models having a difficult time using these labels to learn to accurately classify email complaint contacts.

To address this issue, I had to develop a method to automatically define a new, more semantically distinct, and meaningful set of complaint labels, along with the process of annotating the training data using unsupervised topic modelling. I had experimented with using a variety of topic modelling methods such as ones that leverage word embedding models (i.e., BERT and doc2vec) and ones that use BOW representations such as Latent Dirichlet Allocation (LDA). In the end I was able to automatically define a new set of complaint labels that are semantically meaningful and distinct from each other using LDA, furthermore, through LDA I was also able to automatically and accurately annotate (i.e. label) the training data to facilitate the learning process of the supervised text classification models.

My advice to anyone undertaking their first data science project with a company is to first and foremost understand how the outcome/product of the project will affect the company. It is crucial to familiarise oneself with the potential impact that their work will bring, and once one realises the potential impact of their work it will give them a clearer path towards completing the project with all aims and objectives successfully achieved.

Another advice I would give is when facing a difficult issue during the project, is to try to think outside the box during the process of coming up with solutions to address it by drawing inspirations from your prior knowledge and experience in addition to the research done to find approaches to addressing the issue, because that is how I managed to solve the issue I encountered regarding the provided set of labels.

What was your favourite thing about working on this project?

Sagar: I thoroughly enjoyed every aspect of this project, and the fact that I applied cutting-edge Deep Learning techniques to a rich food dataset to solve a novel and complex problem has to be my favourite. I would also attribute my enjoyment to working under my supervisors Steven George from Gousto and Dr Leandro Soriano Marcolino from Lancaster University.

Ishaq: What I love about the project are the challenges that I have encountered while working on it which forces me to experiment with and explore new ideas and methodologies that I have previously not touched on before. The work I have done here nicely builds upon the data science and NLP knowledge and experience I have accumulated during the MSc programme and has served as an excellent experience for me in dealing with real-world, large-scale, and unclean textual data.

Overall, I am happy with how the end-to-end pipeline I have developed have not only produced a text classification model that could automatically and accurately (with weighted f1-score of 0.85) classify incoming email complaint contacts, it has also automated the process in facilitating the training of the implemented text classifier by using unsupervised topic modelling to automatically define a set of labels that is highly representative of the training data as well as to annotate it accordingly.

What skills did you gain during the project and how will these help with your future endeavours?

Sagar: Through this project, I have gained a solid understanding of how Transformer architecture-based models work and exposure to the world of Multimodal Machine Learning. I can confidently say that I can use PyTorch and that I have built and trained a complex Multimodal network from scratch. I would like to thank Gousto for providing me with the opportunity to work on this project. I relished the unparalleled complexity and novelty that this project offered.

Ishaq: The most crucial take-away from this project with Gousto’s data team for me is the experience of developing a complete end-to-end NLP-based data science pipeline for the purpose of conducting text classification, during which I had experimented with and utilised the state-of-the-art word embedding models within the various stages of the pipeline. Furthermore I also had the opportunity to implement various deep learning based models using Keras to classify the email complaint contacts. I am confident that the data science and NLP experience I have gained from this will equip me with the necessary knowledge and skillset that I need to confidently undertake and do well in the next NLP projects that I may be assigned to in a future full-time NLP data scientist role.

Thank you for reading and keep a lookout for Part II of this series! In the meantime why not check out our series comparing Data professions and give us a follow here to stay up-to-date with our blog posts.

--

--