The data ecosystem explained in 10 simple images
Couple of weeks ago I was explaining how IoT and BigData can be a game changer in sports and human activities. Yeah, it’s cool to put a sensor in your own grill so that it can automatically tell you the exact time you should change side of your filet, but I think it’s even cooler to instrument the human body and objectively quantify human activities, especially in sports and healthcare, so that we can even enjoy more and more the sports we love and the life we live. After all knowledge is power, right? And the first step to knowledge is the ability to measure: IoT is really a big deal here.
But then? From a business perspective, how and where all the gathered data will be stored? What all terms like Data Lake, Data Science and the likes really mean? Surprisingly enough, the answer is as simple as describing the life cycle of something that everyone know very well: orange juice.
I created a slide deck, named Data Juice, to explain the whole data ecosystem to C-level guys that had to understand the business value of every piece in such ecosystem, but without having to deal with technical details. It had to be clear, concise and easy to remember. The juice sample was just perfect, as you we’ll see.
I created a slide deck to explain the whole data ecosystem: it had to be clear, concise and easy to remember.
Now, the same concept applies to IoT: the sensor working right now in the wearable you’re wearing, almost surely collect raw data (acceleration, magnetic field, etc) at something around 50 times per second (up to 100 sometimes). But what you care about is not that raw value, but the trend of your preferred exercise performance.
What’s the line that connect those two points? Happily enough for me, answering to this question allows me also to describe what kind of work I do, and how it is related to sports, which was one of the question I’ve been asked recently by a well-known tech & lifestyle magazine.
Here’s the answer:
New data is generated everyday, everywhere. Accounting systems, IoT devices, machine logs, clickstreams…anything generate some kind of data. At this stage data is generated and consumed, usually, locally and in very small sizes. For example an accounting system generates an invoice, and allows users to interact with it. One a time. Just like oranges that are born on the tree. You can just grab one and consume it in place.
Oranges are harvested and sent to factory for processing. The factory needs to be prepared to get oranges of any size, kind and color and different ripe status. Sometimes it will get only one load a day, sometimes several per hour. What happen with data is really similar. The famous 3V, Volume, Variety and Velocity says that your data system must be able to accept almost any kind of data. Again, just like with oranges: if you want to be in the business, you better be prepared to handle them when they arrive.
All the ingested amount of data needs to be stored somewhere. And it better be a place where space can grow as need and it should also be pretty cheap if you don’t want to spend all your money in storing everything, even if you don-t know if and how you will use that data in future. Such place also need to support querying and processing capabilities in order to allow you to dive in such lake. You don’t really care too much about shape, format or correctness here. You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.
You just store data that as it was generated, just to have it in case you need it. Even if it may be not good for consumption.
Once named also ETL, Extract-Transform-Load, this is the process where data is selected, cleansed, standardized and integrated with additional informations, if available. More or less how oranges are cleaned, divided in different groups for different usage, and if not in good condition, discarded.
Once data has been processed by the previous system, it needs to be stored in safe place. Safe not only in the meaning that no-one can steal from it, but also in the meaning that everyone who will get data from it, can assume it is safe to consume. Again, just like oranges, if you get one from a market warehouse you can assume safely enough that you won’t be poisoned. The same goes with data. If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions. The same does not apply to a Data Lake, where usually you don’t really know what you’ll get and how to deal with it.
If you get data from the Data Warehouse, you can safely assume that it is safe to consume, meaning that you can use it to take informed decisions.
Once you get your oranges from the store, you can decide how to consume them. If you have hundreds of boxes you won’t consume it one by one. You’ll surely try to transform it in a more easy to consume product. Depending on the target, you may want to process it in different ways, in order to make sure that everyone gets exactly what they need, no more and no less. A Data Mart is exactly this: data ready to be consumed with very little additional processing, if needed at all.
Of course, different people, or different situation may require different aspects of that data, so having more than one Data Mart is quite common. If something is not clear enough in the Data Mart, one can alway go in the Data Warehouse that lies behind and check how and why such unexpected data has been produced.
Also known as Self-Service BI, this is where you’re not happy with the Data Mart, since none of them provides what you exactly need, you have to do it yourself. You need to squeeze the juice of three different oranges, of three different kind, in order to create the perfect mixture you want. You go in the Data Warehouse (usually, but sometime you can also grab data from different Data Marts) and create your mix. Maybe adding a hint of cinnamon taken from the outside world.
Using all the things described so far, you apply the Business Intelligence process to figure out how to drive your business, check how it is performing and analyze the outcome of defined strategies to stay in business and, possibly, be successful.
What to do with all the vast amount of unused data in the Data Lake? Maybe among the oranges that does not fit the current sales plan and target, there is the next “big thing”. Something you can leverage to create a new product, sell a new service or something that can help you to understand if there is anything in the collect-store-process-enrich life cycle that can be improved. Usually this means that you are looking for question, and not for answers. This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.
This is the main difference between the process of doing Business Intelligence and the process of Data Science: in the first you look for answers to questions you know, in the second you look for questions you don’t know yet.
In conclusion this is what the information technology, from a data standpoint, is all about: squeezing every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.
We squeeze every drop of information out of all the data we have, so that we can gain knowledge that ultimately brings to wisdom.
So now, when someone ask me what kind of work I do, I say that “I squeeze information out of raw data”. Easy, understandable and, actually, really precise.