
Introduction
In this blog, I’m going to be describing my first ever data science project. While there will be some technical jargon related to machine learning algorithms, for the most part, the focus is on the end to end journey and what I have learned along the way.
The project was completed as a part of a UTS MDSI (Master of Data Science and Innovation) subject called iLab 1, in which students are presented with a range of real-life data related problems experienced by actual organisations and companies (called UTS’s industry partner). The students are assigned the role of data science consultants servicing the industry partners as clients.
I chose a client that has a problem with text analysis. I have always been interested in NLP and text mining, and I regarded this project as an opportunity to learn more about these topics.
Problem Statement
The client is a digital platform provider that targets students who require academic help. This is done by connecting the students with an appropriately qualified tutor through the platform. After the session, the student has the options to:
· rate the session one to five stars; one being the worst and five being the best
· leave comments, which is in a free text field
The goals of the project are to:
· assign topics to student comments
· calculate the sentiment score of the comments
· predict student ratings
Skills Before Starting
I started my degree at the beginning of 2018 (two years ago, at the time of this writing). I had completed three subjects by the time I started the project; the subjects were all done using R, which is the official language in the course. However, I wanted to extend these skills in Python. I had done a few MOOCs (Coursera, Datacamp, Udemy) on machine learning using Python, but never completed an end to end project using Python.
Technical Skills
Before starting the project, I had an understanding of the concepts in the data science methodology, which includes splitting the data into stratified training and testing sets, bias-variance trade-off, hyperparameter tuning, ensemble learning and many more. These were gained in the first three subjects of the course using R. However, I had not gained fluency in using the packages for implementation.
I was comfortable using python for more of the enabling skills in the workplace such as interacting with numerous Excel files, automation of manual tasks and monitoring database states.
Soft Skills

I had not had any experience as a consultant nor working on a data science project. I had only solved problems that were narrowly defined within the confines of an assignment. The implication is that there is an implementable solution and the students are supposed to find it using the techniques taught in the course. Such problems are also largely isolated from other factors that are outside of the considerations of statistical modelling and machine learning algorithms. These other factors could and often does have a huge impact on decision making.
Dealing with an end-to-end real-life data science problem is a whole different ball game. Communication with the client is often an overlooked skill for Negotiating More Data. Having a business context helps prevent time-wasting on unfruitful investigations. Such investigations could be low value due to a misunderstanding of the client’s strategy or even useless if the client would be unable to implement it for reasons such as legislation, internal policies, lack of budget or lack of expertise; the budget issue is perfectly illustrated by the Netflix challenge algorithm that worked wonders but was ultimately rejected due to the engineering effort required to implement it.
Learnings

There are prescribed learning objectives outlined in the subject guide, however, after some deep reflection, most of the learning was more in the realm of meta-learning, i.e. learning to learn, much of which will be slowly unfurled in the following sections.
Furthermore, we had to create two of our own learning goals, against which our performance would be assessed. The two I chose were finally defined as:
· Discover actionable insights, production code, thorough analysis and documentation
· Identify and execute relevant algorithms and services for solving the given problem
The sections below show some of the salient aspects of my journey in this project.
Negotiating More Data
One of the first things I did after receiving the data set was to make a list of all the variables and rows that were not included but could have been useful for the analysis. These include student and tutor tracker variables, i.e. any arbitrary value that uniquely tracks the student/tutor so that it would be possible to group the sessions by students/tutors. These could be used to model the student’s or tutor’s idiosyncratic behaviour that leads to a particularly high/low rating. An XGBoost model would eventually show that these two variables were ranked in the top five in importance.
Even at this very early stage, I had to keep front and centre the client’s directions and intended outcomes while asking for more variables. It would have been easier to request a whole lot of data in the hopes of finding some sort of interesting findings, later on; it was another thing to articulate precisely how a particular variable (perhaps combined with others) could be used to show a way to improve the service.
Upon reflection it is something that I should be more aware of, not just in the context of asking for more data, but when requesting others in the workplace to undertake tasks such as clearer documentation, more thorough testing, closer adherence to coding guidelines etc. It would be more effective if they were to get on board as to why it would be a good thing to do in the first place rather than just relying on my authority to force others to do things.
Self Reflection

Working full-time and studying part-time is never an easy lifestyle. However, I felt that especially to be true with this subject. Due to the flexibility of the subject, most (if not all) of the learning objectives were under my control. Which is very different to the traditional subjects, in which there are set topics to master by exam time, a period of about 12 to 14 weeks; the challenge then would be to allocate one’s time effectively throughout the 14 weeks. In iLab1, the set of topics itself is variable. Now I have a say in what I want to be able to achieve by the end of the 14 weeks.
The amount of learning possible is proportional to the amount of time spent on the subject. At a greatly reduced time available for study meant that the learning goals should also be reduced commensurately and savagely prioritised.
Naturally, setting one’s own criteria also became essential. I needed to know when I have achieved a level of competence in a particular learning goal to be able to make the informed decision to start working on another. This was often problematic, as I would repeatedly underestimate the time required, and one goal’s allocated time slowly crept into the next goal on the list. After a few weeks, I realised that I had not achieved as much as I had expected, at which point I would have to rework the plan and discuss with the client what’s possible with the remaining time.
It’s good preparation for taking on side projects in a freelancing capacity, although there are two major differences. One is the fundamental difference in type of goals. In the subject, the goals were all of the learning type, whereas, in the real world, the goals would be the delivery of a particular analysis or product, presumably with sufficient competence to complete such analysis or product. Having sufficient competence would allow for a more accurate estimate of the time required to complete the tasks, whereas, in my case I didn’t know what I didn’t know so it was much more difficult to estimate the time to learn something that could eventually take two weeks or two months to master.
From this, I have installed two processes:
· weekly reviews
· contingency plans
Business Understanding and Context

Business understanding and context are the missing links that often get cited in stories of machine learning projects going out of control. The usual story is that a group of machine learning engineers spend months to successfully increase the accuracy from 99% to 99.5%, which is not a trivial task despite the seemingly small improvement. However, the business cannot implement the solution because getting to 99.5% requires data that is not usually available and computer power that is way over budget. The solution is ultimately rejected.
There were several small pieces of work that I did for the client that eventually ended up nowhere because the insight I proposed required the client to take actions that were contrary to the company policies. This meant one wasted week of going down what I thought was a fruitful rabbit hole. From that point onwards, I made sure to clearly state my intention for analysing a particular set of variables and ask (as a fellow student once said) “If I could show you that …, what would/could you do with it?”
Additionally, the extra communication often reveals their larger strategy for in the market, which can be complemented with some research on the competition could reveal information that allows for synergies with the current project. This could impact on the design of the solution and take the analysis in a completely different and more value-adding direction.
Upon reflection, I feel like I had missed out to potentially capitalise on this extra information and hence to deliver a more robust solution. The takeaway for me in this regard is to look beyond the given mandate and data set. Understanding the project allowed me to successfully obtain more data, understanding the client’s pain points and strategy would have allowed me to think more globally and provide suggestions that are outside the mandate.
Group Work vs Individual Work

Group work is the dreaded mode of all university assignments; fears of the laggard and the overbearing self-appointed leader abound. When I first found out that no one else was going to be working on the same project, I was quite happy about the fact that I will get a complete say in the direction of the analysis. However, even a few days after receiving the data set and the mandate, I felt some symptoms of paralysis by analysis. I was researching methods to process free-text review data and how to extract features, I was thinking about how to slice the data, which combinations of variables to plot, there were too many possible steps forward, and I didn’t know which one to take. Occasional gatherings with fellow students confirmed the numerous ways in which to proceed, however, none of them could give me an answer beyond the superficial because none of them had the same amount of understanding as I did.
I struggled to formulate a hypothesis and investigate that hypothesis because I was just second-guessing myself; my mind kept on jumping to other possibilities, and I couldn’t focus on just one path of analysis. I can honestly say that having another person working on this project would have led to a better outcome.
I am now aware of the difficulties, so next time I would seek at least one peer with whom to tackle such a project.
Desirable Difficulty

I don’t think anyone who completed the subject felt hunky dory all the way to the end, and if they did, then I would argue that they didn’t reach high enough. The term desirable difficulty is related to the phenomenon that the learning process is generally not pleasant as it is happening, but, as the name suggests, is nevertheless desirable because it leads to learning. The common pitfall is to be aware of only the difficulty, which happens quite naturally to humans, and to completely miss the desirable part. It is a common assumption that comfortable is the goal state and any deviations from the comfortable state to the uncomfortable (or difficult) state is undesirable. However, this is a fallacy. Perhaps a common motto of the go-getters illustrates this more saliently, “Get out of your comfort zone.”
Some people have suggested that it would have been better to wait until one has more skills before undertaking the iLab. There is merit to this and while I agree that there should be a threshold for technical proficiency — if you don’t know how to fit a glm then maybe you should wait a semester or two (or maybe not, depending on your background and time available) — the question largely misses the point. The whole point of learning is to set goals that are slightly out of reach at the beginning of your journey. Difficulty should be expected and considered the default if there is any goal of learning. In this situation, expecting difficulty mitigates the automatic response to run away from it. The trick is to be aware while this is happening and then to acknowledge the learning that has happened as a result.
As Lemony Snicket once said, “If we wait until we’re ready, we’ll be waiting for the rest of our lives.”
Technical

Becoming more fluent in the machine learning “language”. I see the functions and objects as the verbs and nouns of a language, proficient command of the language allows for a more critical and engaging piece of communication. In a way, the technical side was the easiest (the others being the self and client management) to acquire for me. Having the conceptual understanding, in terms of implementation, it was a matter of:
· reading the scikit learn documentation, such as this one on RandomizedSearchCV
· reading individually written articles for examples of implementation, such as this one on stacking
· reading Wikipedia articles for a general introduction to the topics, such as this one on hyperparameter optimisation
Now I’m proficient in creating pipelines and hyperparameter tuning for all steps of the pipeline.
I tried my hand at using the VotingClassifier and VotingRegressor for ensembling. However, the score was not better than all of the individual classifiers. More learning is needed on the conditions for under these algorithms produce superior performance. Furthermore, while VotingRegressor’s aggregation function is simply to take the mean of the estimates, more sophisticated techniques include StackingRegressor.
While there are still a lot of that I don’t know, at least now I have a sense of what I don’t know and what I need to research given a set of project requirements. I am better able to estimate the time it would take me to complete a task and hence the project.
Reflections on Learning
As I mentioned at the beginning, in addition to the technical knowledge gained, there were also a lot more besides. One of the most important ones was self-reflection. I think that this is one of the most effective ways for optimal learning and growth. While it would still be possible to learn and grow without self-reflection, it would be difficult to evaluate if the knowledge gained was what was expected, whether one should take extra steps to improve and whether one is going in the desired direction. Another crucial benefit of self-reflection is also to realise that difficulties are not necessarily a bad thing to be avoided (as mentioned above in desirable difficulty).
The old me would have chosen paths that would lead to the most pleasant experience, not knowing that this is probably not the optimal choice. Now I can rest assured that when I experience difficulties, I don’t aggravate the situation by trying to escape it all the time and instead accept it willingly knowing that I will become a better person because of it.
Next Steps

There will be iLab2, in which I’m looking forward to applying all the learnings gained to hit the ground running.
Furthermore, I’m also excited to try out these skills workplace, both from a technical machine learning aspect and from a stakeholder-communication point of view as well as maximising my learning through regular self-reflection and thorough planning.
