A Retrospective Look On What it Means To Be A Data Scientist
I’ve talked about this subject in some of my posts in my earlier years of working as a data scientist, namely in these 2 blog posts:
So now 3 years down the road, I guess I am a little more knowledgeable on the matter, a little bit wiser.
Back to that definition I was talking about, recently there has been two articles which I think provides a good description of what are the skills needed to become a data scientist, and what are the role that a data scientist play in a day to day setting. In the final half of this post, I’ll include my 2 cents on the articles and how it relates to my daily work.
The Skills 
Picking it up from Forbes (which in turn picked it up from Quora), the top 5 skills are:
I guess this is pretty much a no brainer. Programming skills do come in handy especially when you’re trying to (1) massage data, and (2) automate repetitive tasks.
2. Quantitative Analysis
This is where that “science” part of data science comes from. A data scientist is basically expected to be able to perform experimental design on selected audience and later analyze the results. Using those results as a base, one can then build complex statistical model to predict customer behaviour and guide business initiatives.
3. Product Intuition
If you’ve ever seen the venn diagram by Drew Conway on what defines a data scientist, you’ll notice that he has also included the “Subject Matter Expert” as one of the 3 main area. And it’s not without it’s merit — having expertise in the subject matter/product does go a long way in helping a data scientist understand customer behaviour and explain the data at hand, and see whether their initial assumption of how a customer should behave conforms to the what the data is saying.
With all the findings from 1–3, being able to convey your findings to the stakeholders becomes much more crucial.
There are a lot more ways that one can describe the needed skills, and if you want to be anal about it — you can. But the 5 above (to me) is a good distillation of the different facets of skills needed to become a data scientist. You could probably go read the original Quora posting for more details.
The Roles 
A data scientist from Coursera defines these 4 roles as the possible career paths for a data scientist (As in what you’d actually be doing in your company. Being called data ninja/hacker/miracle worker is just simply weird IMO.)
1. Decision Scientist
To develop and test hypotheses that are key to the product and business direction. What this means is that they’re really going to be making data-driven decisions for the product and for business strategy. And a lot of the methods they’re going to use are things around experimental design, like A/B testing, and statistical modelling. They need to be able to explain how things are going to regress over time and understand the causality of behaviour.
2. Machine Learning Engineer
Build data products powered by machine learning models. An example of this is the recommendation system deployed by Amazon or Netflix; to Tesla’s self driving car. There’s much emphasis on machine learning techniques and software engineering, and a big importance on deploying it at scale.
They (the folks at Coursera) has also defined Data Infrastructure Engineer and Business Intelligence Engineer as the sort of roles that a data scientist might go into. In my opinion though, I’d probably be more comfortable in calling them Data Engineer. However titles are merely just titles. From experience, I’ve received a lot of calls from recruiters asking me whether I’m interested to become an ETL Developer, Big Data Engineer, Hadoop Developer and the likes while I was working as a data scientist. So the folks at Coursera might have a point.
I’d encourage you to check out their webinar and watch the whole video (especially if you’re an aspiring data scientist).
My Day To Day
Part of what triggered me to write this post is also attributed to this podcast (Partially Derivative). In the podcast the guy argued that when he wanted to make the transition from an academic setting to the industry, he didn’t have a clue of what to expect and how different the setting and expectations would be.
It was kinda amusing to hear that to be honest — but totally understandable. People don’t normally talk about how mundane their work day is in their blog or social postings. Coupled with the fact that the world basically lack data scientists at the moment, probably compounds that fact. There IS no “a day in the life of” for a data scientist.
So I suppose I could give it a try eh?
- The first thing I’d do as I arrive (once I’ve put down my bag at my desk) is head to the pantry and get myself a good cup of coffee.
- I join my team’s daily standup meeting, discussing what I did yesterday with my team and product manager (who is remotely checking in via video conference). The whole thing take about 15 minutes.
- I open my laptop and check out my emails on Outlook. Slack is opened as well since it’s the de-facto communication medium at the office. We automate a lot of the reminders (meetings, alerts etc) there since it’s convenient.
- Time to work. I open up a few browsers on my monitors. One for code development, one for research and another for personal stuff (like my Spotify playlist).
- For development, I usually use Jupyter and code in Python. Nowadays though I’m using Databricks — and I have only good words for it. I’m trying to make myself love Sublime (since it’s hyped up so much), but keep finding myself going back to Notepad ++ (old habit die hard).
- Downloads data from AWS S3. Does a quick SQL query and plots a graph. Finds it weird that people are working for more than 100 years. Thinks that asian are too hardworking and are probably overworked.
- Scraps initial hypotheses. Believes that data is an outlier. But what the best way to remove outlier?
- Do some research for the best way to remove outlier for this given dataset. 2 standard deviation away from the mean? Probably not as outliers would skew the mean value. Continue looking for additional academic research papers for more brainy thoughts and ideas.
- Realizes that I’ve spent too much time doing research without having done any serious coding yet. Brief panic attack.
- Manage to compose self back to normalcy and decides that it’s best to adopt simplicity for now. “Over analysis leads to paralysis”, I think to myself.
- Head home, while thinking to self that there must be a better way to solve that outlier problem. Thinks that I need more time.
The above is somewhat an accurate picture of what goes on in my daily work. Other things to note:
- I don’t interact much with my colleagues while I’m in my “zone”. I find it really hard to shift my focus from coding and thinking/theorizing to chatting and back to coding.
- I didn’t mention it above, but I do have my lunches.
- Self-doubt is my common theme. And I’m always looking for perfection (when they aren’t any). In a way it kinda drives me to seek more than what I could settle for, which is probably a good thing for me ( But not for my employer, since it might have a negative impact on the project timeline. The struggle is real.)
- For the day above, I’ve used programming and quantitative knowledge to extract, massage and clean the data.
- I’m not the subject matter expert in my current industry (job marketplace — i came from a telco background), so that’s where constant communication with the stakeholders are crucial. They’re just a Slack message away. Example questions are something like : (1) Is this how the data is normally distributed? (2) What does this category encoding mean?
- In terms of roles based on what I’ve mentioned earlier in this post, I’d say the bulk of the things I’m doing now is more to Data Engineering as compared to Machine Learning or Decision Science.
- As a data scientist, especially since it’s a relatively new field in this country — one can’t expect to focus solely on a particular domain of interest (I somehow think that most people would only think of ML when they’ve decided to jump into this bandwagon). As the market matures however, I’d expect the industry’s expectation would also evolve and we’ll be seeing a lot of area specific job titles being used instead of simply — Data Scientist.