The Impact of Domain Knowledge on Data Science

Rodrigo Veríssimo
3 min readJan 3, 2020


Glorious Morning Light and Reflections by Rodrigo Veríssimo

Data Science is getting increasingly popular with more and more new data scientists starting to venture into solving real-world problems. These new data scientists are very well taught regarding Computer Science, Math and Statistics, because that’s what Universities and data science academies do! Is that everything that you need to do Data Science? No! However, it’s a good starting point to grow from.

There is a very important part of it that naturally misses from what is taught. Check the graph below.

Source: https://​towards­da­ta­​why-data-science-succeeds-or-fails-c24ed­d2d2f9

I know, I know. I bet that you already saw a variation of this chart, but do you see something that is missing? That’s domain knowledge! It can be defined as the knowledge of the field that the data belongs to.

You can throw a state-of-the-art Machine Learning algorithm to a dataset with some basic data cleaning and feature engineering and you won’t be getting the best estimates that you can get. In other words, you’re not getting the most out of what you already have — the data. If you understand your dataset and also how the company you work in works, you will get much better estimates. And you’ll feel much more confident about your work as well. How can that be? Someone that has the required domain knowledge will do the following naturally:

  • Always on the lookout for more tables in the database and external datasets relevant to the problem to join to the current dataset;
  • Knows which samples can be considered outliers and removes them with some level of certainty that they’re outliers and not some cases where it makes sense. Imagine that you work for a shipping company and you want to estimate delivery times. If you just throw an outlier removal strategy, like using the Z Score, you may remove shipping routes that are long but that are normal and not that uncommon;
  • Correctly splits the data into train, dev and test sets with strategies that simulate the conditions that the data science solution will be facing. This will avoid unpleasant surprises once the solution goes into production;
  • Engineers relevant features that are domain-specific which can range from Physics formulas to relevant time-series statistics;
  • Understands and avoids having unnecessary technical debt in the data and features that are being used for the data science pipeline. This is done, by removing irrelevant and unnecessarily costly features as well as unmaintained data sources.

If you are new to Data Science, don’t worry as this will come with time and experience. However, you can speed things up if you know what to focus on. You can start doing these things if you haven’t already

  • Talk daily or weekly with your teammates about their projects and try to understand them as well as their current challenges and solutions. Once you get accustomed to them, you will naturally start to give them some interesting suggestions as your domain and technical knowledge expands;
  • Talk to analysts, data scientists and other colleagues outside your team to go get to know their projects, because there may be overlapping projects or datasets that you can help each other;
  • Read about your domain. If you’re in e-commerce, fraud detection, time series forecasting of wind turbines, music streaming, … read about it! Not only technical/data science reads, but also what is happening in the field itself.

As Warren Buffet advises:

By far the best investment you can make is in yourself.

If you’re new to data science and don’t have any domain knowledge, don’t worry. Be humble, ask questions and be willing to learn and help with what you can. With time, dedication and passion you will get there and hopefully be a great data scientist.



Rodrigo Veríssimo

Data Scientist @Farfetch | Portrait and Street Photographer @developedgrain