By Stefano Cariddi
Let us be frank: everyone with access to the internet has encountered, at least once in their life, the term data science, and this has probably happened in the past five years.
In this article we aim at introducing some key concepts that are necessary to understand what data science is, how it is related to machine learning, and why it is different from another trending topic, the business intelligence.
Probably, the most successful explanation of what data science is dates back to 2010 and is represented by the following figure:
This is the famous Drew Conway’s Data Science Venn Diagram, and it is the starting point for our discussion.
In Conway’s Venn diagram, Data Science lies at the intersection of three fields of knowledge. Hacking skills means computer science: a data scientist is a computer scientist who works with data. Math & Statistical Knowledge is the basis for understanding the tools of the trade. Substantive Expertise, or domain knowledge, is necessary for planning the path that leads from raw data to a result.
These three macro-areas of competencies have three intersection areas: Danger Zone, Traditional Research, and Machine Learning. Let us analyze them one by one.
One is in the danger zone if he/she thinks that being decent at coding while having a good domain knowledge is enough to achieve breakthrough results. It is not. Just as a carpenter that does not know the difference between a nail and a screw would probably do a poor job at building a cupboard, a wannabe data scientist that does not understand the math behind his/her tools risks to do more damage than harm.
Traditional research, on the other hand, is where a high level of domain knowledge meets math and statistics. In the field of traditional research, researchers explain observations with theories, so either the data fit in their theoretical framework, or the researchers have to implement it in order to account for the discrepancies. This means that the data are connected to, and explained with, a known relation.
Machine learning, finally, is where computer science meets the mathematical and statistical knowledge, without an a priori knowledge of the field where this will be applied. This is not a danger zone because the goal of machine learning is to be able to model how a phenomenon works, not why it manifests itself. Sometimes it is not even possible to produce an explanation concerning why a specific model gives correct results (this is, for example, the case of neural networks). A machine learning model is like a human being: it learns through examples. A child does not need to know Newton’s Law of Universal Gravitation for knowing that things fall. They just need to know that this happens in order to learn how to shoot a basketball. Hence, this is the right point to stress a key concept: more data does not imply more information, and more information does not imply more comprehension. Machine learning aims at extracting information, not at understanding it.
Finally, the intersection between all the macro-areas in Conway’s diagram is where Data Science lies. Data Science consists in understanding a problem, owning the mathematical tools for cracking it, and having the coding skills required for turning intentions into results. Therefore, data science is an interdisciplinary field.
Having understood what data science is, it is appropriate to connect it to business intelligence, which is a closely-related hot topic. The easiest way to capture the difference between them is the following: whereas business intelligence aims at providing useful insights on known data, data science aims at framing those insights on a model that will be applied to unknown data. Therefore, we could express this relation with the following proportion:
Business Intelligence : Descriptive Analysis = Data Science : Predictive Analysis.
Now, if you are interested in the topic but you are afraid of entering a new field, we find it fitting to quote Jake VanderPlas:
I would encourage you to think of data science not as a new domain of knowledge to learn, but as a new set of skills that you can apply within your current area of expertise.
- Python Data Science Handbook, 2016, O’Reilly
But always keep in mind that:
With great power comes great responsibility.
- Stan Lee
Discover more about Ennova Research