The Data Life Cycle

Gressi Lopez
9 min readDec 16, 2021

--

Data, within a business context, can often be thought of as individual pieces of facts that are collected and then analyzed to answer a given question for a business. This process seems simple enough, but in fact, there are several complicated stages to make sense of data. This article simplifies the process, but it should be noted that there are alternatives to any of these actions for a given stage.

Raw Data

Any detail or fact we can observe can be collected and labeled as data. Often times, the trick is understanding what you want to actually collect. We are going to use a simple example of a coffee shop. This coffee shop has several streams of data that they can collect. Daily transactions can be logged from their point-of-sale system, or if they have a loyalty program, they can collect customer information that range from basic demographics to their most purchased drink. The coffee shop might also have social media accounts, which have likes or comments that we can gleam insights from.

Let’s analyze the social media stream of data a little further, because it gives us a great example of two different types of data: structured vs unstructured. There are certain metrics already built into our social media account that provide us with structured data. These include such things as the amount of views on a video, the amount of likes for a post, or the number of comments. This data can be collected in a structured method. We can track and observe number of comments per post by creating a column in a spreadsheet and entering the data for each post into a row.

What if we deem this data to be irrelevant? (Often times marketers actually will and label this as a vanity metric). We want to instead look at what the comments say! But for a given post we average +1,000 comments and sometimes they don’t make sense. This is now unstructured data. There is no predefined structure for comments. This means that when a user makes a comment there is no prompt saying “you must use 10 words and one adjective per sentence, and answer within two sentences.” Comments are instead irregular and not easily transformed into quantifiable data. However, there are methods using cloud computing, AI tools, and machine learning to extract and transform this unstructured data into analyzable data. For the sake of simplicity, this article will solely focus on the life cycle of structured data.

The Database

We have now established the sources of data within our coffee shop scenario and decided to focus on structured data for now. The task is to now collect all this information in a consolidated area for us to look over. If only there was such a thing capable of doing that… (*cough) a database dude!

A database does just that for us! It is an organized collection of data that we are storing. In our coffee example we can add our data into a database and perform CRUD operations. This means that we use CREATE, READ, UPDATE, and DELETE functions within our database to help facilitate the proper management and collection of this data. This environment is typically referred to as the online transaction processing system (OLTP). Part of the reason it has earned the title of an OLTP is that the database collects, for the main part, all transactional information. This, once collected, can be organized using CRUD functions. Tables can be created to store information and based on entities (i.e. the subject of the table) and their respective attributes. They are then linked together based on a commonality that is typically depicted by a relationship link. There are a lot of nuances to the actual relational aspect of tables within the database but for purposes of this article, the big take away is that transactional information can be stored in tables that are linked to one another. From there, a user (sometimes a database administrator) can perform simple queries to read data into the database or perform small analysis. This data is periodically updated, dependent on the companies’ requirements, and then deleted if deemed no longer necessary.

Some limitations to databases are that it does not offer us the proper ability to analyze historic data. Management of this data is typically done for a limited time as a constant flow of incoming transaction information can create a great deal of strain on databases. To help solve this limitation companies, like our coffee shop, can collect all their historical data within a new environment capable of handling the volume: the data warehouse.

The Data Warehouse

Data Warehousing is a methodology for storing and delivering information to support decision making. There are two predominant methods used for creating data warehouses: Inmon and Kimball. These two different approaches are differentiated by how they decided on objectives, their actual model, and the implementation. This environment will become the online analytical processing system (OLAP).

Determining Requirements

Inmon — An enterprise-wide focus is used when gathering requirements for this approach. This starts by indicating key subject areas and business operations which the business cares about. Its goal is to gather data pertinent for the longevity of the enterprise at a top-level strategic view.

Kimball — Tactical and ad hoc team queries are better served using this method. Requirements still take into consideration the enterprise; however, data marts are created based on subject and priority. The goal is to quickly address key business questions through reporting.

Physical Model

Inmon — Logical models are built for each subject area and are normalized to the third form to reduce data redundancies. This lowers the complexity of loading data and allows the model to serve as a true single source of data for the enterprise. However, querying power is lost as we require more joins/tables to query the data warehouse. In addressing this problem, data marts are built for departments. Data marts subset information for a single subject and are used by a small group (i.e. departments) as a smaller version of the central data warehouse. These data marts are also de-normalized to allow a drill-down of the information. The following chart depicts what a typical trajectory of data looks like for Inmon.

Kimball — The model starts by analyzing key sources of information and bringing that data into a staging area (i.e. extracting). Within this stage, data is transformed, and conformed dimensions are created to meet to drill-down various facts. Dimensions are shared among facts, in which each subject is typically organized using a star schema. Multiple star schemas are created to create the data warehouse in which various facts are analyzed using the conformed dimensions. This model can be thought of as a collection of data-marts in which each analyzes a specific business requirement (i.e. fact table) using all or some of the conformed dimensions. The following diagram depicts how this approach would look like with just one data mart, but remember that many other data marts would eventually funnel into one big data repository.

Logistics

Inmon — Building this model is a time requirement and could range from 4–9 months to implement if not more. This also requires a large-sized team of specialists to develop the data warehouse environment and maintain it. This method is flexible and has the benefit of working with volatile source systems, given that data marts can be constantly created to meet different business requirements.

Kimball — This approach is implemented in a faster pace than Inmon, which could range from 2–3 months. With conformed dimensions and no need for the central data warehouse, a small team could handle the implementation and maintenance. This method is better at reporting on pertinent business requirements but works best with stable source systems.

Data warehousing is a great structured path for getting the best analytical processing from traditional online transactional processing systems. There are some limitations in what can be accomplished with a data warehouse. The primary issue is the implementation, which Kimball attempted to combat with the bottom-up approach. However, there is still a need for highly skilled individuals who can perform the process of taking information from a database to a data warehouse. This is typically called the ETL process or extraction, transforming, and loading procedures. Companies need to find talent that is capable of performing these procedures and tailor results for whatever business requirements are wanted to be met. Maintenance of the environment once it has been set up is a relatively simpler task, although still requiring a further investment in talent.

Individuals with these skills can also run into several problems given newly added factors/trends of big data and cloud computing. (We address big data further in a later section). The emergence of cloud computing has now allowed companies to lessen their hardware capital and focus on their resources on other things within their information technology environment. This shift requires even further training and understanding from skilled individuals, adding another barrier and intricacy to data warehousing.

Alternatives to The Data Warehouse

Up to this point we have identified the traditional method of the data life cycle but there are always alternatives to this path. The Kimball methodology itself was at one point considered an alternative but has now come to rival Inmon. The biggest influences on why companies diverge from this path, deal with the growing need to deal with big data.

Data that is classified as big data is determined by 5 Vs: velocity, volume, value, variety, and veracity. Given these 5 Vs traditional data management cycles aren’t always equipped to extract information and if they are, these methodologies don’t use the full potential of the data. Some companies favor to ditch the data warehouse in entirety and instead create a depository of data that continuously collects and aggregates data with little to no structure. This repository is considered a data lake. (The data lake holds all raw data typically in the form of copies from source systems). It is crucial to understand that although the data lake is a repository of source systems there still needs to be some management of it. When this isn’t managed or given some structure it can turn into a data swamp. A data swamp is characterized by its inability to offer any meaning data because of its unmanaged nature. Anything past this can be defined as a data pit in which there is no usefulness.

Implications

Going back to our coffee shop scenario, the likelihood of them needing to expand to a online analytical processing system seems very unlikely. However, it is possible that this small shop can begin to scale up and continuously creating franchises and becoming a conglomerate. There expansion would soon be limited by the already highly standardized and homogeneous array of competing offers. They would have to create a core competency within their company and a distinguishing factor in the minds of consumers.

In this instance the core competency can be the management of their data. If this local coffee shop is able to take advantage of their source systems and begin to analyze the trends within already collected source systems, they’d be set up to conduct quantitative analysis on their consumers and cater products towards them. For example, a small data mart can be created to analyze a sales(i.e. our subject). We can then analyze sales by various dimensions such as location, time, guest, product, etc. This information would help shape how the marketing department would go about in promoting products or entering new markets.

This core competency of a highly data-driven and analytical culture would in turn also help distinguish the growing coffee chain from national brands or other local brands as it caters offerings to a specific market.

Conclusion

The data life cycle has many nuances that can take even the best data architect time to develop a keen sense for, plus the constantly evolving rate of technology and rise of big data make it even harder to implement traditional methods of management. However, understanding this data life cycle acts as foundations for further exploring the field of data management.

--

--