What I’ve Learned as a Data Scientist

Gang Su
6 min readMar 21, 2018

--

This is a long overdue article and I have been thinking on the back of my head for a while now, about what I have learned as a Data Scientist over the years. I’ve enjoyed most of my career so far, and everyday feels like a fresh start. It’s a challenging, constantly evolving but super exciting job, and I just want to share some personal perspectives. No two data scientists are identical!

There are many genomic variants of data folks — data scientists, business analysts, research scientists, data engineers, analytics engineers, machine learning scientists… just to name a few. Some are perceived more “premium” than the others (SQL monkeys v.s. Deep Learning gurus). I have seen many raise the question of “What should I learn to become a data scientist?”, a typical answer is a mix of some coding skills, some statistics, some sql, some machine learning. However to me this is not the sufficient answer. Similar to GRE/TOEFL to grad school, technical competencies are merely bare minimum prerequisites, and in today’s open internet it doesn’t take long for someone to build a predictive model in a couple of hours from scratch with 0 prior exposure (and I have sit in some of the 1 day DS bootcamps). Knowing programming, statistics/ML and data wrangling merely just get you to the door — you’ve got to think hard: if you are to hire someone as a data scientist, what type of talent will you value, and how will that person contribute to the business?

So let’s set aside the techinical competencies aside and talk about what differentiates a star data scientist to an OK data scientist (and by no means I am there yet). First and foremost, the most important attribute, to me, is judgment. The fundamental mission of any data science role is not to build complex models, but how to add value to the business. Not many places like Google, Facebook or Microsoft can sponsor high caliber researchers to dig into extremely complex artificial intelligence (again, this is a hype word) problems independent of business needs. Even for these tech moguls, those strategic big bold bets are also based on the assumption of massive long term returns (such as an upper hand in the self-driving business). For everybody else, the sole and primary function of data science is to support business function — generate growth, improve engagement, reduce inefficiency, and ultimately, more revenue and less cost. Any methodology that can get the business there sooner and quicker will do, regardless it’s from a fortune teller, an experience consultant, or an AI. In the context of a data scientist role, this means:

  • You have to understand where the business is going (company’s North Star).
  • You have to understand how various business units such as marketing, product and engineering are currently working towards the company’s north star, and what their road maps are. Understand alignments and gaps, and how data can drive synergy.
  • You have to understand how the data science function fits into the business organically, and how the team is developing and contributing. Where are the biggest deficiencies both technically and operationally (data quality? pipeline efficiency? Communication clarity? Value?).
  • Then finally, how to get data, what you can learn from the data, what data product you can build cross functionally, and how that may change the business.

As you can see, at every level mentioned above substantial thinking, judgment and communications are involved and it’s not as easy as running EDAs or building models. The biggest risk of many data scientists is working in the vacuum — trying too hard to solve techinical problems without thinking about the business impact. The Why question is much more critial than How, and very often a 80% solution is superior to a 90% solution, which may take 2X resources and time build and deploy. To find the sweet spot is almost an art rather than science.

And on top of the techinical decision making — you should be driving discussions, align teams, provide solid evidence supporting your perspectives about the past and the future. There’re tons of presentations, group meetings, 1:1s, informal chats, and it’s very easy to get lost. You need to have an opinion, an agenda and you must be capable of advocating them. You would be telling a convincing story, in various formats, with various techinical depth, to various audiences in various occasions.

Once you think you can do all of that — then you also need very solid techinical foundation to deliver trustworthy findings. The crediblity of data scientist is everything — one major fallacy would lead to compromised trust, consequently banishment from the core decision tables. You need to build your brand, and rally people around you for the source of truth. You will need to know almost everything with sufficient depth, and become a true full-stack, expert-generalist. You need to be able to work from end to end, bootstrap and unblock yourself whenever necessary. As the business grows, new problems arise and new skills are needed. You have to maintain competitive in your trade while learning new tricks to stay in the game. You are the ultimate swissknife in the horde.

It ain’t easy — you only have this much time a day. The following are some fundamentals for you to get started:

  • Exploration: you need to be able to extract, clean, transform data, big and small, and have intuitions to generate hypotheses, find answers and recommend solutions. You are capable of presenting the findings to a broad range of audience, in tabular and graphical forms. Formally and socially. You are light a path in the mist of data for the fellow folks to follow.
  • Forecast / Prediction: you can build predictive models to outline the future from the past confidently and efficiently. You are versatile and can handle any regression or classification problems, and you know how various algorithms work and the pros and cons. You know where the trade-offs are, and how to address the business problem with minimal investment and maximum gain. You also need to explain various models to folks in business, product and operations, to clear their doubts and fear about the black-boxish solutions.
  • Experimentation: you know how to design experiments to prove the value of your predictions. You know the common frameworks and pitfalls of experiments, and how to instrument them, how to choose appropriate statistical tests and how to measure risks and futitlity. This skill is commonly underestimated as it doesn’t sound as sexy as the modeling piece — however a model running offline can never be proven with real value, especially measured in dollar signs. Experimentation is the only way to definitively prove causality and the value of your work.
  • Measure and Socialize Success: and last but not the least — how do you design and choose metrics, from potentially hundreds of candidates, as the indicator of success? What’s the decision hierarchy and how the metrics indicate the evolution of business in the short term and long term? How do different teams measure similar areas of interest? Again, this is a blend of art and science, and it’s not as straightforward as accuracy or AUC.

And you will exert good judgment in many of these processes. Very often the model performance metric is not the deciding factor of whether you should proceed for production. The potential incremental gain, cost of development time, required resources, implementation complexity, computation time, interpretability, data availability, fail-safe plans, impact to other teams… and many other factors are much more important. You always go for a local optimal solution that can try and fail or success fast. Time is everything.

And on top of all this — you need to be passionate about the business. You don’t come in and just explore data or build models for the sake of doing them, but you want to make the business better with your outstanding work. Back to the first question — what should you learn? Before that, you want to ask yourself where you want to be first. Different companies have very different needs, since you can’t do everything from the beginning, you need to focus — whether it’s time series predictions, natural language processing, recommendation systems or image classifications, you need to figure out what the business is looking for, and make alignments with your personal goals. Instead of spending all your time on leetcodes, you should also spend time covering company’s tech blogs, news coverage, financial reports and leadership’s public talks. You just have to do more than everybody else to be the one. If you are still asking which programming you should learn, R or Python, then you are not thinking hard enough.

Finally, an ideal career to me is 1) I enjoy what I do, 2) I keep learning everyday and 3) a good compensation to support my family. Having two of the three is great, and I am grateful that I have the privilege to enjoy all three of them. It’s the best time for data science — massive data, never been better tooling and ever changing business challenges. There are a lot of great opportunities, but you need to put in A LOT of work. Truth me, it totally worths it.

--

--