First Text Analytics Project

Introduction
In this blog, I am going to be describing my first ever data science project. In the most part, I describe the end to end journey and what I have learned along the way.
The project was a part of a UTS MDSI (Master of Data Science and Innovation) subject called iLab 1, in which students are presented with a number of real-life data related problems experienced by actual organisations and companies (called UTS’s industry partner). The students work as data science consultants servicing the industry partners as clients.
I chose a client that has a problem with text analysis. I have always been interested in text mining, and I regarded this project as an opportunity to learn more about text mining topic.
Problem Statement
The client is an IT Division in a university. They provide IT services for students and staffs who have break fix and service requests. The students and staffs provide requests like tickets through a website. The quality of data in respect to root cause and resolution needs some improvement. However, a lot of the valuable information is included in comments rather than specific feeds.
The goals of the project are to:
· assign topics to comments of students and staffs and short descriptions of real causes
· improve the quality of data in respect to root cause and resolution
Skills Before Starting
Before starting the project , I started my degree at the beginning of 2019. I had completed three subjects by the time I started the project; the subjects were all done using R, which is the official language in the course. Besides, I had basic skills in Excel.
Technical Skills
Before starting the project, I had an understanding of the concepts in the data science methodology, which includes splitting the data into stratified training and testing sets, bias-variance trade-off, and many more. I gained these in the first three subjects of the course using R. However, I had not gained fluency in applying R skill when working with real data which mean that I had not had any experience as a consultant nor working on a data science project. I had only solved problems that were narrowly defined within some assignments. In these assignments, there is an implementable solution and the students are supposed to find it using the techniques taught in the course.
Besides, I could use Excel to split or basic analysis numerous Excel files.
Soft Skills

I have not presented for client or communicated with client using English before (the reason is that when I worked in my country name is Viet Nam, I just used Vietnamese to study and work). We all know that communication with the client is often an overlooked skill. Having a business context helps prevent time-wasting on unfruitful investigations. Such investigations could be useless if the client would be unable to implement it for reasons such as legislation, internal policies, lack of budget or lack of expertise.
Learnings
I had to create two of our own learning goals, against which our performance would be assessed. The two I chose were finally defined as:
· Use KNIME to explore and visualize data, use SQL to query data.
· Improve presentation and communication skills across the period of project.
The sections below show some of the salient aspects of my journey in this project.
Self Reflection
Studying full-time and working part-time is never an easy lifestyle. However, I felt that especially to be true with this subject. Due to the flexibility of the subject, most of the learning objectives were under my control. The period of the subject is about 12 to 14 weeks; the challenge is that how I can orgernise time effectively throughout the 14 weeks. In iLab1, the set of topics itself is variable. Now I have known what I want to be able to achieve by the end of the 14 weeks.
Since I missed client’s the first email and I had to waist NDA from client, I started the project latter 2–3 weeks than others. Besides, after a few weeks, I realised that I had not achieved as much as I had expected, at which point I would have to rework the plan and discuss with the client what’s possible with the remaining time.
From this, I have installed processes:
· weekly reviews
Business Understanding and Context
There was a piece of work that I did for the client that eventually ended up nowhere because the insight which I found is not useful of meaningful at all with client. This meant I wasted some weeks. From that point onwards, I made sure to clearly state my intention for analysing a particular set of variables and ask “If I could show you that …, what would/could you do with it?”
To be honest, group work is the dreaded mode of all university assignments. When I first found out that no one else was going to be working on the same project, I was quite happy about the fact that I will get a complete say in the direction of the analysis. However, even a few days after receiving the data set and the mandate, I felt some symptoms of paralysis by analysis. I was researching methods to process free-text review data and how to extract features, I was thinking about how to slice the data, there were too many possible steps forward, and I didn’t know which one to take. Occasional gatherings with fellow students confirmed the numerous ways in which to proceed, however, none of them could give me an answer beyond the superficial because none of them had the same amount of understanding as I did.
I had some solutions to handle problems with text data in the project, however, my mind kept on jumping to other possibilities, and I couldn’t focus on just one path of analysis. I can honestly say that having another person working on this project would have led to a better outcome. Now I am aware of the difficulties, so next time I would seek at least one peer with whom to handle such a project.
Desirable Difficulty
Some people have suggested that it would have been better to wait until one has more skills before undertaking the iLab 1. I agree that there should be a threshold for technical proficiency; if you don’t have a basic knowledge about how to deal with data then maybe you should wait a semester or two (or maybe not, depending on your background and time available). The whole point of learning is to set goals that are slightly out of reach at the beginning of your journey. Difficulty should be expected and considered the default if there is any goal of learning. In this situation, expecting difficulty mitigates the automatic response to run away from it.
Technical
Becoming more fluent in the machine learning “language”. I see the functions and objects as the verbs and nouns of a language, proficient command of the language allows for a more critical and engaging piece of communication. In a way, the technical side was the easiest (the others being the self and client management) to acquire for me. Having the conceptual understanding, in terms of implementation, it was a matter of:
· learning new languages or new argrithms, such as learning and SQL on https://www.w3schools.com/sql/default.asp
· reading individually written articles for examples of implementation, such as this one on stacking
Now I can use KNIME to explore and visualize data and use SQL to query data.
I tried my hand at using an algorithm called Latent Dirichlet Allocation (LDA) on students and staff comments to discovers hidden topics by looking for important keywords, both in documents and across the entire corpus. However, the result is even not meaningful. More learning is needed on discovering automatically discover topics in a collection of documents.
Reflections on Learning
As I mentioned at the beginning, in addition to the technical knowledge gained, there were also a lot more besides. One of the most important ones was self-reflection. I think that this is one of the most effective ways for optimal learning and growth. While it would still be possible to learn and grow without self-reflection, it would be difficult to evaluate if the knowledge gained was what was expected, whether one should take extra steps to improve and whether one is going in the desired direction. Another crucial benefit of self-reflection is also to realise that difficulties are not necessarily a bad thing to be avoided (as mentioned above in desirable difficulty).
The old me would have chosen paths that would lead to the most pleasant experience, not knowing that this is probably not the optimal choice. Now I can rest assured that when I experience difficulties, I don’t aggravate the situation by trying to escape it all the time and instead accept it willingly knowing that I will become a better person because of it.
Next Steps
There will be iLab2, in which I’m looking forward to applying all the learnings gained to hit the ground running.
Furthermore, I’m also excited to show these skills to employers when I am going to seek a job as soon as finishing ilab1. These skills are both from a technical machine learning aspect and from a stakeholder-communication point of view as well as maximising my learning through regular self-reflection and thorough planning.