We have seen in a previous post what are the common misconceptions in big data analytics, and how relevant it is starting looking at data with a goal in mind.
Even if I personally believe that posing the right question is 50% of what a good data scientist should do, there are alternative approaches that can be implemented. The main one that is often suggested, in particular from non-technical professionals, is the “let the data speak” approach: a sort of magic random data discovery that should spot valuable insights that a human analyst does not notice.
Well, the reality is that this a highly inefficient method: (random) data mining it is resource consuming and potentially value-destructive. The main reasons why data mining is often ineffective is that it is undertaken without any rationale, and this leads to common mistakes such as false positives; over-fitting; neglected spurious relations; sampling biases; causation-correlation reversal; wrong variables inclusion; or eventually model selection (Doornik and Hendry, 2015; Harford, 2014). We should especially pay specific attention to the causation-correlation problem, since observational data only take into account the second aspect. However, according to Varian (2013) the problem can be easily solved through experimentations.
Hence, I think that a hybrid approach is necessary. An intelligent data discovery process and exploratory analysis are valuable at the beginning to correctly frame the questions (“we don’t know what we don’t know” - Carter, 2011). Then, the question has to be addressed from several perspectives and using different methods, which sometimes may even bring some unexpected conclusion.
More formally, and in a similar fashion to Doornik and Hendry (2015), I think there are few relevant steps for analyzing the relationships in huge datasets. The problem formulation, obtained leveraging theoretical and practical considerations, tries to spot what relationships deserves to be deepened further. The identification step instead tries to include all the relevant variables and the effects to be accounted for, through both the (strictest) statistical methods as well as non-quantitative criteria, and verifies the quality and validity of available data. In the analytical step, all the possible models have to be dynamically and consistently tested with unbiased procedures, and the insights reached through the data interpretation have to be fed backward to improve (and maybe redesign) the problem formulation (Hendry and Doornik, 2014).
Those aspects can be incorporated into a lean approach, in order to reduce the time, effort, and costs associated to data collection, analysis, technological improvements, and ex-post measuring. The relevance of the framework lies in avoiding the extreme opposite situations, namely collecting all or no data at all. The next figure illustrates key steps towards this lean approach to big data: first of all, business processes have to be identified, as well as the analytical framework that should be used.
These two consecutive stages (business process definition and analytical framework identification) have a feedback loop, and the same is also true for the analytical framework identification and the dataset construction. This phase has to consider all the types of data, namely data at rest (static and inactively stored in a database); at motion (inconstantly stored in temporary memory); and in use (constantly updated and store in database).
The modeling step embeds the validation as well, while the process ends with the scalability implementation and the measurement. A feedback mechanism should prevent an internal stasis, feeding the business process with the outcomes of the analysis instead of improving continuously the model without any business response.
This approach is important because it highlights a basic aspect of big data innovation. Even if big data analytics is implemented with the idea of reducing world complexity, it actually provides multiple solutions to the same problem, and some of these solutions force us to rethink the question we posed in a first place.
All these considerations are valid both for personal project and companies’ data management. Working in a corporate context requires also further precautions, such as the creation of a solid internal data analytics procedure.
Data need to be consistently aggregated from different sources of information, and integrated with other systems and platforms; common reporting standards should be created - the so-called master copy - and any information should be validated to assess accuracy and completeness. Having a solid internal data management, together with a well-designed golden record, helps to solve the huge issue of stratified entrance: dysfunctional datasets resulting from different people augmenting the dataset at different moments or across different layers.
All the information here presented are not one-size-fits-all solutions, and should be carefully adapted to different situations, teams, and companies, but are in my opinion a good starting point to ponder over big data processes.
Carter, P. (2011). “Big data analytics: Future architectures, Skills and roadmaps for the CIO”. IDC White Paper. Retrieved from http://www.sas.com/resources/asset/BigDataAnalytics- FutureArchitectures-Skills-RoadmapsfortheCIO.pdf.
Doornik, J. A., & Hendry, D. F. (2015). “Statistical model selection with big data”. Cogent Economics & Finance, 3, 1045216.
Harford, T. (2014). “Big data: Are we making a big mistake?” Financial Times. Retrieved from http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#ixzz2xcdlP1zZ.
Hendry, D. F., & Doornik, J. A. (2014). Empirical model discovery and theory evaluation. Cambridge, Mass.: MIT Press.
Varian, H. (2013). “Beyond big data”. NABE annual meeting. San Francisco, CA, September 10th, 2013.
Note: the above is an adapted excerpt from the forthcoming book “Big Data Analytics: A Management Perspective” (Springer, 2016).