Why domain knowledge is important in Data Science

3 min readMar 18, 2019

What is data science?

Before we answer why, we have to understand what data science actually is. According to Wikipedia “Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.”

Simply put data science is a field where data in its raw form is processed into information.

What does domain knowledge mean?

The term “Domain Knowledge” has been in play even before data science became popular. In software engineering, it means the knowledge about the environment in which the target (i.e. software agent) operates.

We can use the same definition in data science to say — “Domain knowledge is the knowledge about the environment in which the data is processed to reveal secrets of the data”. In other words, the knowledge of the field that the data belongs to is known as Domain Knowledge.

How does domain knowledge influence data science?

You may have studied data science and machine learning and used some machine learning algorithms like regression, classification to predict on some test data. But the true power of an algorithm and data can be harnessed only when we have some form of domain knowledge. Needless to say, the accuracy of the model also increases with the use of such knowledge of data.

For example, the knowledge of the automobile industry when working with the relevant data can be used like — Let’s say we have two features Horsepower and RPM from which we can create an additional feature like Torque from the formula

TORQUE = HP x 5252 ÷ RPM

This could potentially influence the output when we train a machine learning model and result in higher accuracy.

Where is domain knowledge useful?

You may have understood from the above example that domain knowledge is best useful in feature engineering. Feature engineering is creating features using the domain knowledge to optimize the machine learning algorithms.

Let’s see an example in the economics related data to support what we have seen so far. The combination of economics and mathematical concepts is called Econometrics and machine learning particularly regression is being used widely these days to create insights using the raw data. We see two models, one without feature engineering and one with feature engineering using domain knowledge of economics. To keep this blog simple and concise we will only fit the models and compare them.

For example, let’s take the Catalonia GDP data which you can download here. Let’s see the head of the dataset

We will then apply linear regression before and after applying domain knowledge.

Without application of domain knowledge

Code without feature engineering

Output

0.9905101700533853
3325.856952652541

With the application of domain knowledge

We apply domain knowledge in creating features like trade openness by combining two features total exports and total imports and domestic demand per GDP without construction by subtracting construction sector from domestic demand and dividing the result with GDP as shown below.

Feature Engineering

And finally, fit the model and check its scores.

With feature engineering

Output

0.9961357892269006
2122.2901303188205

Conclusion

In conclusion, we see that using feature engineering by applying domain knowledge gives better accuracy score and lesser RMSE than the model without. You can get the entire code here.

References

Catalonia GDP kaggle kernel

Contact

You can give me your feedback in the comments below or email me at saianand0427@gmail.com. I am on Kaggle and GitHub as well!