Data science and Big Data: definition and common myths
I. First thoughts and definitions
Big data is nowadays one of the most common buzzwords you might have heard of. There are many ways to define what big data is, and this is why probably it still remains a really difficult concept to grasp.
Someone describes big data as dataset bigger than a certain threshold, e.g., over a terabyte (Driscoll, 2010), while others look at big data as a dataset that crashes conventional analytical tools like Microsoft Excel. More renowned works though identified big data as data that display features of Variety, Velocity, and Volume (Laney, 2001; McAfee and Brynjolfsson, 2012; IBM, 2013; Marr, 2015). And all of them are somehow true, although I think incomplete.
The first class of definitions is indeed partial, since it is related to a purely technological issue, i.e., the computational need overcomes the available analytical power of a single tool or machine. This would not explain however why big data came out a few years ago and not back in the Nineties.
The second opinion is instead too constraining, since it assumes that all the features have to be satisfied to talk about big data, and it also seems to identify the causes that originated big data (i.e., a huge amount of fast and diverse new data sources), rather than its characterization.
There are also many other definitions that could be used (Dumbill, 2013; De Mauro et al., 2015), but my personal definition is the following: data science is an innovative approach that consists of different new technologies and processes to extract worthy insights from low-value data that do not fit, for any reason, the conventional database systems (i.e., big data).
Data science is an innovative approach that consists of different new technologies and processes to extract worthy insights from low-value data that do not fit, for any reason, the conventional database systems (i.e., big data).
II. Data Misconceptions
Data are quickly becoming a new form of capital, a different coin, and an innovative source of value. It is extremely important to learn how to channel the power of big data into an efficient strategy to manage and grow a business. A well-set data strategy is becoming fundamental to every business, regardless the actual size of the datasets used. However, in order to establish a data framework that works, there are a few misconceptions that need to be clarified:
i) More data means higher accuracy. Not all data are good quality data, and tainting a dataset with dirty data could compromise the final products. It is similar to a blood transfusion: if a non-compatible blood type is used, the outcome can be catastrophic for the whole body.
Secondly, there is always the risk of overfitting data to the model, yet not derive any further insight — “if you torture the data enough, nature will always confess” (Coase, 2012). In all applications of big data, you want to avoid striving for perfection: too many variables increase the complexity of the model without necessarily increasing accuracy or efficiency. More data always implies higher costs and not necessarily higher accuracy. Costs include higher maintenance costs — both for the physical storage and for model retention; greater difficulties in calling the shots and interpreting the results; more burdensome data collection and time opportunity costs.
Undoubtedly the data used do not have to be conventional or used in a standard way — and this is where the real gain is locked in — and they may challenge the general wisdom, but they have to be proven and validated. In summary, smart data strategies always start from analyzing internal datasets, before integrating them with public or external sources.
Do not store and process data just for the sake of having them, because, with the amount of data being generated daily, the noise increases faster than the signal (Silver, 2013). Pareto’s 80/20 rule applies: the 80% of the phenomenon could be probably explained by the 20% of the data owned.
ii) If you want to do big data, you have to start big. A good practice before investing heavily in technology and infrastructures for big data is to start with few high-value problems that validate whether big data may be of any value to your organization. Once the proof of concept demonstrates the impact of big data, the process can be scaled up.
iii) Data equals Objectivity. The interpretation of data is the quintessence of its value to business. Ultimately, different types of data could provide different insights to different observers due to relative problem frameworks or interpretation abilities (i.e., the framing effect).
Let’s also not forget that people are affected by a wide range of behavioral biases that may invalidate the objectivity of the analysis. The most common ones between both scientists and managers are: apophenia (finding patterns where there are no patterns at all), narrative fallacy (the need to fit pattern to series of disconnected facts), confirmation bias (the tendency to use only information that confirms some priors - and the corollary according to which the search for evidence will eventually end up with evidence discovery), and selection bias (the propensity to use always some type of data, possibly those that are best known).
A final interesting big data curse to be pointed out is nowadays becoming known as “Hathaway’s effect”: it appeared that when the famous actress appeared positively in the news, stock prices in Warren Buffett’s Berkshire Hathaway company increased. This suggests that sometimes there exist correlations that are either spurious or completely meaningless and groundless.
iv) Your data will reveal you all the truth. Data on its own are meaningless if you do not pose the right questions first. Readapting what DeepThought says in The Hitchhikers’ Guide to the Galaxy, big data can provide the final answer to life, the universe, and everything, as soon as the right question is asked. This is where human judgment comes in: posing the right question and interpreting the results are still competencies of the human brain.
Coase, R. H. (2012). Essays on economics and economists. University of Chicago Press.
De Mauro, A., Greco, M., & Grimaldi, M. (2015). “What is big data? A consensual definition and a review of key research topics”. AIP Conference Proceedings, 1644, 97–104.
Driscoll, M. E. (2010). “How much data is big data?” [Msg 2]. Message posted to. Retrieved from https://www.quora.com/How-much-data-is-Big-Data.
Dumbill, E. (2013). “Making sense of big data”. Big Data, 1(1), 1–2.
IBM. (2013). “The Four V’s of Big Data”. Retrieved from http://www.ibmbigdatahub.com/infograp hic/four-vs-big-data.
Laney, D. (2001). “3D Data Management: Controlling Data Volume, Velocity, and Variety”. META group Inc., 2001. Retrieved October 27, 2015 from http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity- and-Variety.pdf.
Marr, B. (2015). Big data: using smart big data, analytics and metrics to make better decisions and improve performance, (p. 256). Wiley.
McAfee, A., & Brynjolfsson, E. (2012). “Big data: the management revolution”. Harvard Business Review, 90(10), 6–60.
Silver, N. (2013). The Signal and the Noise: The Art and Science of Prediction. Penguin.
Note: a first version of this article appeared in the Science to Data Science (S2DS) program blog, and part of the material is taken from the forthcoming book “Big Data Analytics: A Management Perspective” (Springer, 2016).