How Much Do You Know About Your Data And Is Your Product Ready To Benefit From Data Science?
This is the second part of a 3-part article on the four hurdles to creating value from data. In the first part, we gave an overview of the four hurdles. We posited that the main challenges in extracting value from investments into all things data boil down to who (and to a lesser extent what) the businesses invest in and the order in which they happen. In this second part, we will discuss in detail the first two hurdles (of the four in total) that organisations need to overcome in order to get data working for them:
- Lack of appreciation of the different types of data and what each entail
- Products, data assets and budgets that are not ready
There is more to data than you might think
It is often a good idea to develop a deeper understanding of the things that you have to deal with before you actually start dealing with them and data are definitely one of them. Do you know that there are many types of data out there? And by type, I do not mean things like int, float and char in Computer Programming 101, or categorical, ordinal, etc in Statistics. Instead, we are discussing about data in the context of their roles in building data solutions at the enterprise level. There could be as few or as many depending on who you ask. It is one of those things that even the people with their skin in the game cannot agree on. Some say there are the magic number 13 (unholy, yes I know) while others posited that there are 7 (then revised to 11) types. The discussion here is not so much about the exact number as it is about what they stand for which has implications on how they are managed and used, and the ensuing challenges. I have always related rather well to the distinction of “things” vs “activities”, which has also been covered elsewhere. In other words, there are only two types of data in the context of this discussion, and it is data about things and data about their activities.
Big Data Disrupts Credit Applications As We Know Them | Data Driven Investor
Convergence is the latest buzzword in finance, and thanks to the relationship between payments and loans, we are now…
Data about things can be referred to in a few ways. The more widely known one is master data from the master data management discipline. This type of data is used to represent the key entities or objects that underpin the business’ operations, e.g., customers, products, users, reviews, places. This type of data also covers reference data, which is further data to classify and categorise the things with which they are associated. Reference data can exist as predefined flat lists such as product types (e.g., free, premium) or it can also be richer in the form of a taxonomy or ontology. For instance, each user in the master data record can have a job title or topics of interest, which further contextualise or describe the entities beyond the 4 walls of the enterprise. This type of reference data give more in-depth and advanced organisation and understanding of the master data in ways that otherwise would not be possible. Take the example of a subset of the users who have “iOS Developer” as their job titles and the other subset with “Android Developer”. Through the rich reference data in the form of a job title ontology, we can infer that these two groups of users are essentially related through a parent “Mobile App Developer”.
The other type of data is data about the activities by the things. In more main streams terms, these cover the more traditional transaction data, e.g., this customer buys that product, as well as the more modern day ones such as this user likes that review or this user is in this place where customers, products, users, reviews and places are all things. The term “big data” is typically reserved and used to describe the latter which takes the form of continuous streams of events from large online marketplaces, products and services that exist purely online and internet of things devices. As one would expect, across both traditional and big activities data, timestamps, some numeric values to qualify the transactions, and references in the form of unique identifiers to the things that are involved with the activities. For the purpose of this discussion, excuse my casual use of “small” data to refer to everything else besides big data.
While most businesses are rather familiar and capable with the managing of their data about things and the more traditional data about activities, the challenge is often appreciating the drastically different mindset and infrastructure needed to cater for big data. This is where most organisations that struggle to get value from their investment into big data trip over. This is where the race ends for such businesses even before it begins. Many of the opportunities for innovation using data rest with big data that are thoughtfully fused with everything else that organisations already have. I have seen organisations that expect value from big data before they have even invested in the right technology and more importantly bring big data together with small data in a way that can be worked on. Please do not fall into that trap. Why?
Firstly, big data need an entirely different set of tools to deal with compared to when you only have to manage small data — starting from the point when they get captured and being stored through to the different processing steps that take place to make them ready for use by data professionals. There is a whole raft of concepts and technologies that you would need to wrap your head around such as Kafka to deal with messaging between large number of data producers and consumers, S3, HDFS and Cassandra to deal with cloud or distributed storage of large volume of structured or unstructured data, and Spark and MapReduce to access and process the data, amongst others.
Secondly, having specialised infrastructure to deal with big data is not enough if you have not designed and built it with consideration for how both data big and small would interact. They are both critical to building the “data moat” for your business. Treating them separately could introduce serious gaps in the potentials of the data that your data professionals are trying to work with. An example would be, say you have 5 million events flowing through every day about users interacting with pets advertisements on your marketplace. These are activities data in huge volume. If your data infrastructure does not help resolve the users and the pets back to who they really are using your data about things, then your big data is as useful as a spade for your drilling task. This is when concepts such as transformation and enrichment of your big data supplemented with small data are critical.
Thirdly, we need to give small data the much needed attention as they are often overlooked in the middle of the big data hype. As discussed earlier, there is more to small data than just the usual data about the explicit aspects of things such as names, email addresses and phone numbers of customers and users. More and more so data scientists who attempt to build smarter AI solutions to problems are beginning to realise that the power of big data can be greatly amplified with small data — more specifically taxonomies and ontologies that supplement the otherwise flat lists of reference data. For instance, knowing that a user is an Android developer is useful but what would be 10x more valuable is knowing how a subset of developers are connected to another subset of users through the sheer fact that they are both mobile app developers. This piece of knowledge would only come through ontological assets.
The availability of open or proprietary carefully curated and managed data assets enables better intent recognition and better understanding of the semantics of content for semantic search. [src]
In short, before you venture any further into investing into data technology and infrastructure, let alone building out teams of data professionals and expecting products to be improved instantaneously, the awareness of the different types of data, what they stand for and how they should be managed are critical. If you are not data savvy, make sure you have someone alongside you who are empowered to strategise and make the calls on data investment to advice you.
Are your products, data assets and budgets ready?
Having the biggest big data in the world will not fix a broken product or one that you do not quite understand the market fit. This is especially true if you do not yet have the infrastructure to help you make the raw data useful or are not tracking the right data. It goes without saying that data professionals such as analyst and scientists need meaningful data in usable forms for them to work with. In the previous section, we discussed a lot about the more nuanced aspects of data and the need to factor them into any investment decisions on technology and tooling. In addition to the data infrastructure aspects of readiness, the state that your product is in is also important. This is the case of before building big teams of data professionals and investing too much in the relevant data technology, you have to make sure that your product managers are doing their job. Let us dissect what I mean by that.
Data science requires data to science, and most companies don’t have much data on day one. [src]
Assume you are the product manager of the search product in Company Pets4U. Your job is to make that sure that there is high volume of activities and transactions using the search engine. The product was built without much user research or any understanding their jobs to be done. However, what you do know is that data are important so you plug in something that you are familiar with, say Google Analytics. It provides you with metrics about how many visitors you get, how many searches each perform, the top keywords, perhaps even conversion from clicks on the results to successful pet adoption. You then get some dashboards to monitor how your search product is going and the numbers tell you that the search volume is low, less than a tenth of your user base perform searches, and if they search, they do so infrequently, and the conversion is abysmal. You were alarmed and thought about getting some data science help to make your search better. You put forward a case, got a junior data scientist and tasked the poor fellow to fix the problems. Please do not fall into that trap. Why?
Firstly, as the product manager for the search engine, you have to ask yourself this — are users using it the way you intended for them to solve whatever problems they are trying to solve. Many things can cause poor level of user activities and conversion, and for a number of them, data science cannot solve. If the product that you have built do not quite help the users with what they need to do, your tracked data are unlikely to tell you that without potentially trying to connect dots that do not exist. You need to first go back to the basics of user research and other ways of validation that could involve data but not the kind that data science deals with. It could be the case that the users do not need a search engine. Instead they need a recommendation product that emails them with suggestions of pets for adoption. In other words, you could potentially be giving the users a shovel when they need a drill. If such is the case, then you may be doing everything right by the data tracking and infrastructure book but the usefulness of the data is questionable. This is the case of the product not being ready for data scientists to work on.
Secondly, even if we assume that the poor adoption of your product or dissatisfaction of your users are caused by things that can be solved with data science, in the scenario above it is unlikely that your product has collected enough data or data in a way that is suitable for data science solutions. Often with off-the-shelf tracking solutions, they are designed and marketed for product analytics purposes. What this means is that you are unlikely to get the level of granularity in data that data scientists need to work with to produce machine learning models, etc. In this case, you might know that your product is not performing well but your data scientists would not have the data prerequisites to try to dissect further and actually solve it. All the discussions in the previous section about the different types of data and how they need to be considered in the choice of technology and infrastructure have to be put into practice here. The inability to do that well will greatly hamper the readiness of the data assets that get produced for data professionals to work on. In addition to that, involving the right people who will actually be working with the data earlier on to define the requirements is often a good idea.
Thirdly, on the point of budget, I know that a junior data scientist costs much less to your business than say a senior or lead level person. But as we have all heard of the saying, you get what you pay for. The issue here is not so much about the technical abilities of the junior person filling the role. Rather it is the person’s ability to find the solution that is just right for the problem you are trying to solve given all the constraints that you face without going overboard or entirely missing the mark. This takes years of grooming and practice in the right environments to build up the experience. This is especially bad if the expectation is that the junior data scientists are to work independently with very little guidance. Often my suggestion is that if you are starting out, hiring few experienced ones is ideal instead of many junior roles. In addition to being able to work on your problems with the right solutions more independently, the more senior people are also more likely to be “full stack”. In other words, they have wider range of experience and knowledge and able to perform greater range of tasks that traditionally would not fall into the realm of data science such as data engineering, deployment of models, etc. Do not get me wrong, junior data scientists are great but almost always only after you have a team of more experienced data professionals.
In short, it is perfectly OK for your organisations and your products not to have embraced data yet to solve your problems if you are not ready, for reasons discuss above. As they say, do not put the horse before the cart. It is perfectly OK to spend the early days focusing on the strategy and the market, and validating your products with small data and verbatim feedback. It is also likely that initially, you will find yourself doing a lot of grunt work to prepare whatever data needed to power your new products once you have found a spot in the market for what you are doing and collecting the right data for the future. Data professionals cannot work with data that do not exist. At the same time, once your products and data assets are ready, make sure you hire right. If your business is in its early stages and you can afford it, start with more experienced individuals to work on the data to maximise the chances of getting the outcomes you want.
The last part of the 3-part article is:
Are You Able To Match The Right Data Science Solutions To Problems And Retain Your Data People?
In the first and second part of this 3-part article, we covered the importance of knowledge about the different types…
If you need to refer back to the first part, it is here:
Four Hurdles To Creating Value From Data
Data provide organisations the new platform in the 21st century and beyond for innovation of products and processes —…
Gain Access to Expert View — Subscribe to DDI Intel