From Batches to Streams: How to Navigate the Data Product Life Cycle: A Comprehensive Guide (Part 2)

Mohamed Awnallah
30 min readMay 16, 2023

--

TLDR

This article delves into the Data Product Life Cycle, outlining the stages from conception to retirement, and emphasizing the significance of considering factors such as company size, industry, and maturity level. It covers processing method selection and the role of technology costs, speed, and maintenance, with case studies featuring Uber at different stages. The article concludes with a discussion of limitations that require attention in the fourth generation of the Data Product Life Cycle. Additionally, the article includes introductions to:

  • Product Management 101
  • Economics 101

Table of Contents (TOC)

  • I. Introduction
  • II. Product Management 101
  • III. Economics 101
  • IV. Data Product Life Cycle
  • V. Importance of considering company size, industry, and maturity level
  • VI. The role of technology costs, speed, and maintenance in choosing a processing method
  • VII. Early-Stage Companies (A Case Study of Uber [Before 2014])
  • VIII. Mid-Stage Companies (A Case Study of Uber [2015–2016])
  • IX. Enterprise Companies (A Case Study of Uber [2017 — present])

I. Introduction

In today’s digital age, data has become a valuable company asset to gain insights, improve decision-making, and create new products and services. However, to fully leverage the potential of data, companies need to go through a well-defined process known as the Data Product Life Cycle.

The Data Product Life Cycle involves a series of steps starting from data collection to the retirement of a product. The life cycle phases may vary depending on the company size, industry, and maturity level. Therefore, it is essential to understand the process and its nuances to make informed decisions on data processing methods and technologies.

In this article, we will delve into the Data Product Life Cycle, its phases, and the importance of considering company size, industry, and maturity level. We will also discuss the role of technology costs, speed, and maintenance in choosing a processing method. Finally, we will explore case studies of early-stage, mid-stage, and enterprise companies and the Data Product Life Cycle stages they went through.

So, whether you are a data analyst, data engineer, machine learning engineer, or simply interested in the field of data, this article will provide you with valuable insights into the Data Product Life Cycle and how it can be applied in different company scenarios.

Photo by Roku Channel on Giphy

II. Product Management 101

Qualitative metrics play a crucial role in the Product Life Cycle by providing valuable insights into the non-numerical aspects of product performance. Understanding and effectively utilizing qualitative metrics is essential for achieving a comprehensive understanding of a product’s impact and aligning it with customer needs.

5 Key Mistakes Companies Make Leading to Business Failure

New products and companies have a high failure rate. About 75% of new businesses fail, and new products have a failure rate of between 40% to 90% in terms of achieving significant adoption in the market. This problem exists even on a smaller scale, where most ideas fail to provide value for customers or businesses. Even successful companies like Microsoft, Amazon, and Netflix have a failure rate of between 50% to 70%.

Photo on guvi
  1. Putting the Cart Before the Horse

Putting the cart before the horse is an idiomatic expression that refers to doing things in the wrong order, i.e., trying to accomplish a goal without first taking the necessary preliminary steps. The most common mistake that most companies make is scaling before having objective evidence that the product is something people want and there is a reasonable market for it.

In the context of business, this could mean trying to sell a product before understanding the target market’s needs or developing a marketing strategy. It can also involve investing heavily in infrastructure or technology before determining if there is a viable market for the product or service.

This approach can lead to wasted resources, lost time, and ultimately, a failed business. To avoid this mistake, businesses should focus on understanding their target market and their needs before investing in product development or infrastructure. This approach can help ensure that businesses are meeting the needs of their customers and are not overspending.

Putting the Cart before the horse

2. Build — Build — Build

Just because you can build something does not mean you must. This approach involves continuously adding new features or products to a business without considering customer feedback or the market’s needs.

The problem with this approach is that it can lead to a lack of focus and direction, as well as overspending on development costs. It can also result in a product that is over-engineered and does not meet the target audience's needs.

To avoid this mistake, businesses should focus on developing a Minimum Viable Product (MVP) that meets the essential needs of their target audience. They should then use customer feedback and market research to iterate and improve the product before expanding their offerings. This approach can help ensure that businesses are meeting the needs of their customers and are not overspending on unnecessary development costs.

Photo PerfectLoops on Giphy

3. Cognitive Biases

Cognitive biases are a common mistake that can lead to business failure. Cognitive biases are mental shortcuts or errors in judgment that can lead to flawed decision-making. There are about 150 cognitive biases exist, which can cause us to make irrational decisions due to automatisms in our brains (The term is used to describe behavior that occurs when a person is unconscious and unaware that the action is taking place)

In the context of business, cognitive biases can lead to a range of negative outcomes, including overconfidence, confirmation bias, and sunk cost fallacy. For example, overconfidence bias can cause business leaders to overestimate their abilities or the success of a new product or service, leading to overspending and financial losses.

Innovator bias is the tendency to overestimate the potential benefits of new ideas or technologies due to their novelty, leading to poor decision-making. To mitigate innovator bias, it’s important to carefully evaluate new ideas or technologies based on objective criteria and seek diverse perspectives. Testing new ideas on a small scale can also help to avoid investing resources in unsustainable or ineffective innovations.

Innovator bias

Confirmation bias refers to the tendency to favor information that confirms existing beliefs while ignoring contradictory information. Innovator bias is the tendency to overvalue the usefulness or importance of new ideas or technologies, simply because they are new or novel. Both biases can lead to poor decision-making and can be mitigated by seeking diverse perspectives and objectively evaluating new information.

Photo by Jimmy Arca on Giphy

Sunk costs fallacy refers to the tendency to continue investing resources in a project or decision, simply because of the resources that have already been invested, even if the project is no longer viable or profitable. This can lead to poor decision-making, as individuals may continue investing time, money, or effort in a failing project rather than cutting their losses and moving on. To avoid the sunk costs fallacy, it’s important to evaluate decisions based on their potential future benefits and costs, rather than past investments that can’t be recovered.

Photo by South Park on Giphy

4. Measuring the Wrong Things

Measuring the wrong things, such as vanity metrics, can lead to business failure by providing inaccurate or misleading information about the performance of the business. Vanity metrics are metrics that may make a business look good on the surface but do not provide meaningful insights into the actual success or profitability of the business, such as social media likes or website traffic so if you have data that you can’t act on it’s a vanity metric.

To avoid measuring the wrong things and falling into the trap of vanity metrics, businesses should focus on measuring metrics that directly impact their goals and bottom lines, such as customer acquisition costs, lifetime value of customers, and revenue growth. Measuring the right metrics will provide more accurate insights into the performance of the business and enable businesses to make data-driven decisions that lead to long-term success.

Photo by simpsonsworld on Giphy

5. Wishful Thinking

Wishful thinking is a common mistake that can lead to business failure. Wishful thinking refers to the tendency to believe in a positive outcome without considering the evidence or potential risks making decisions without data or operating blindly.

Wishful thinking can also cause business leaders to ignore warning signs or feedback from customers or employees, leading to flawed decision-making and missed opportunities for improvement.

To avoid wishful thinking, business leaders should remain objective and data-driven in their decision-making. They should seek out diverse perspectives and feedback and be willing to reassess their assumptions and strategies as new information becomes available. By avoiding wishful thinking, businesses can make more informed decisions and improve their chances of success.

Photo on Giphy

The Uncertainty — Investment curve

The uncertainty investment curve is a model that describes the relationship between the level of uncertainty and the amount of investment that should be made. It suggests that investment should be minimal when uncertainty is high and gradually increased as uncertainty decreases, until reaching an optimal level of investment where the benefits outweigh the costs. Beyond that point, additional investment can become wasteful and risky.

The Uncertainty — Investment Curve

Ash Maurya’s Model

There are many models to explain the life cycle of a product but the Ash Maurya’s model will be referenced here, also known as the Lean Canvas, which is a popular tool used by entrepreneurs and businesses to develop and validate their business ideas. The Lean Canvas is a one-page business plan that helps entrepreneurs identify and address key components of their business model.

Lean Canvas Model

In Ash Maurya’s model, the Lean Canvas identifies three main phases in the product lifecycle:

  • Problem-solution fit: In this phase, for the early-stage companies the focus is on understanding the customer’s problem and developing a solution that meets their needs. This involves testing and refining the solution to ensure that it effectively addresses the customer’s problem. Typically build-measure-learn cycle is involved in this phase.
  • Product-market fit: Once the solution is validated, early-stage companies have achieved problem-solution fit and are focused on achieving product-market fit. The focus shifts to identifying the target market and developing a go-to-market strategy. This involves identifying the right channels to reach the target market and refining the pricing, messaging, and positioning of the product.
  • Scale: In the final phase, for the enterprise companies the focus is on scaling the business and increasing market share. This involves optimizing the business model, refining the product, and expanding the customer base through marketing and sales initiatives. Typically the escalation is related to Customer Creation and Company Building.
Ash Maurya’s model of three main phases in the product lifecycle

Decision-Making Frameworks

Kill, pivot, and preserve is a decision-making framework used by startups and entrepreneurs to evaluate the viability of a product or business idea.

Kill means abandoning the idea altogether because it does not have enough potential for success. This decision is made after a thorough evaluation of the market, customer needs, competition, and other factors that affect the success of the idea.

Pivot means making significant changes to the product or business model to better align with customer needs and market demand. This decision is made when the company believes that the core idea has potential, but it needs to be adapted to better fit the market.

Preserve means to continue with the current idea or product because it has demonstrated potential for success and has a strong market fit. This decision is made when the company believes that the idea is solid and has a strong potential for growth and profitability.

Data-informed and Human-driven decision making

Humans have the inspiration; Machines do the validation

You should be informed by data but not guided by it.

III. Economics 101

In the dynamic landscape of data-driven decision-making, incorporating quantitative metrics is crucial for evaluating the success and impact of data products. After understanding the qualitative metrics in Product Management 101, This section delves into two important quantitative metrics — Return on Investment (ROI) and Return on Invested Capital (ROIC) — within the context of the Data Product Life Cycle.

Return on Investment (ROI)

It’s a way to figure out if you made a good investment or not. Imagine you have a lemonade stand. You bought all the ingredients and supplies for $20, and you sold the lemonade for $40, which means you made $20 in profit.

Now, to find out how much you made compared to how much you spent, we can use a special formula called Return on Investment (ROI). To use this formula, we divide the net income (the $20 profit you made) by the cost of the investment (the $20 you spent) and multiply it by 100.

So, we have:

ROI = Net income / Cost of investment x 100

ROI = $20 / $20 x 100

ROI = 100%

This means that your ROI is 100%. That’s a big number! It means that for every $1 you invested in your lemonade stand, you made $1 in profit.

In other words, ROI helps us understand how much money we make compared to how much money we spend. And the higher the ROI, the better it is for our business!

Photo by BoomUnderground on Giphy

Return on Invested Capital (ROIC)

It’s a way to figure that tells us how much profit a company is earning relative to the amount of money it has invested in its operations. It’s like asking, “If I give you a certain amount of money to start a business, how much money will you be able to make with it?”

let’s say your parents start a small business that makes and sells handmade furniture. They invest $100,000 of their own money and borrow an additional $50,000 from a bank to buy equipment, and raw materials, and pay for other expenses.

If their business earns $20,000 in profit for the year, the ROIC would be calculated like this:

ROIC = Earnings Before Interest and Taxes (EBIT) / (Total Debt + Total Equity)

EBIT = $20,000 , Total Debt = $50,000 , Total Equity = $100,000

ROIC = $20,000 / ($50,000 + $100,000)

ROIC = $20,000 / $150,000

ROIC = 0.133 or 13.3%

This means that for every dollar that your parents invested in the business, they made 13.3 cents in profit. A higher ROIC indicates that the business is generating more profits from its invested capital, while a lower ROIC may suggest that the business is less efficient in using its capital to generate profits.

Photo by BuzzFeed on Giphy

ROI vs ROIC

The main difference between ROIC and ROI is the way they define “capital” in their formulas.

Return on Invested Capital (ROIC) measures the percentage return a company earns on all the capital it has invested in its operations, including both debt and equity. It takes into account the total amount of invested capital that a company uses to generate its earnings, whereas ROI only considers the amount of investment made in a particular project or initiative.

Return on Investment (ROI), on the other hand, measures the profitability of a specific investment or project, typically over a shorter time frame. It is calculated by dividing the net profit of the investment by the amount of the initial investment. ROI is useful for evaluating the performance of individual projects or investments, while ROIC is more useful for assessing the overall efficiency of a company’s use of capital.

ROIC is a more comprehensive measure of a company’s profitability because it considers all of the capital invested in the business, not just one particular investment. In contrast, ROI is a narrower measure that only applies to specific projects or investments.

IV. The Data Product Life Cycle

The data product life cycle is a framework that outlines the various stages that a data product goes through during its lifetime. The data product life cycle is the combination of the Software Development Life Cycle or Machine Learning Life Cycle and product Life Cycle, which is used to describe the various stages that a product goes through from inception to retirement. The data product life cycle is made up of several stages, including inception, development, introduction, growth, maturity, saturation, decline, and retirement. Here are the stages of the Data Product Life Cycle in detail:

Data Product Life Cycle

1. Inception

The inception stage is the first stage of the data product life cycle. During this stage, companies identify the need for a data product and start to plan for its development. This stage involves brainstorming, idea generation, and identifying potential use cases for the data product. At this point, Investors may provide seed funding to support the initial planning and feasibility assessment of a data product idea. This investment may help cover costs associated with market research, prototyping, or proof-of-concept development.

Photo by Trutv on Giphy

2. Development

The development stage is the longest and most complex stage of the data product life cycle. This stage involves several sub-stages, including data collection, data processing, data analysis, data visualization, data action, machine learning model development, deployment, monitoring, and maintenance. Each of these sub-stages is critical to the success of the data product and must be executed effectively to ensure that the data product is accurate, reliable, and useful. At this point, Investors may provide funding to support the design, development, and testing of a data product. This investment may help cover costs associated with hiring data scientists, acquiring data, purchasing hardware and software, or outsourcing development work.

Data Development Life Cycle

2.1) Data Collection

The data collection stage involves collecting data from various sources, including internal and external sources. The data collected may be structured or unstructured and may come from a variety of sources, including databases, spreadsheets, social media platforms, and sensor networks.

2.2) Data Processing

The data processing stage involves transforming raw data into a more structured format that can be used for analysis. This stage may involve cleaning the data, removing outliers, and transforming the data into a format that can be used for analysis.

Photo by snl on Giphy

2.3) Data Analysis

The data analysis stage involves analyzing the data to identify patterns, trends, and insights. This stage may involve using statistical models, machine learning algorithms, or other analytical tools to extract insights from the data.

Photo by simpsonsworld on Giphy

2.4) Data Visualization

The data visualization stage involves creating visualizations, such as charts, graphs, and dashboards, to communicate the insights derived from the data. This stage is critical to ensuring that the insights are understandable and actionable.

2.5) Data Action

The data action stage involves using the insights derived from the data to make informed decisions and take action. This stage may involve developing new products, improving existing products, or optimizing business processes.

2.6) Machine Learning Model Development

The machine learning model development stage involves developing machine learning models to automate data processing and analysis. This stage may involve using supervised or unsupervised learning techniques to develop models that can be used to predict future outcomes.

Photo by Abid Ali Awan on Datacamp

2.7) Deployment

The deployment stage involves deploying the data product to a production environment where it can be used by end users. This stage may involve integrating the data product with existing systems and ensuring that it is accessible to end users.

Photo by space rocket launching on Giphy

2.8) Monitoring

The monitoring stage involves monitoring the performance of the data product to ensure that it is functioning as expected. This stage may involve setting up alerts and notifications to alert stakeholders when issues arise.

Photo by simpsonsworld on Giphy

2.9) Maintenance

The maintenance stage involves maintaining the data product to ensure that it continues to function as expected. This stage may involve updating the data product to address issues or adding new features to improve its functionality.

Photo by Martin on Giphy

3. Introduction (Launch)

The introduction stage is the stage where the data product is first introduced to the market. During this stage, companies may focus on building awareness and generating interest in the data product. At this point, Investors may provide funding to support the release of a data product to the intended users. This investment may help cover costs associated with marketing, advertising, documentation, training, or support.

4. Growth

The growth stage is the stage where the data product experiences rapid growth in demand. During this stage, companies may focus on expanding the data product’s reach and scaling its operations. At this point, Investors may provide funding to support the growth of a data product’s user base. This investment may help cover costs associated with scaling the product, improving user experience, or adding new features.

5. Maturity

The maturity stage is the stage where the data product reaches its peak level of demand. During this stage, companies may focus on optimizing the data product’s performance. At this point, Investors may provide funding to support the maintenance and continued availability of a data product. This investment may help cover costs associated with upgrading the product, providing technical support, or ensuring compliance with regulatory requirements.

6. Saturation

The saturation stage is the stage where the data product’s growth starts to slow down, and demand for the data product starts to plateau. During this stage, companies may focus on retaining their existing customers and developing new use cases for the data product. At this point, investors may provide funding to invest in additional resources to support customer retention and explore new use cases for the data product, which can help the company maintain its market position and continue to generate revenue from the data product.

7. Decline

The decline stage is the stage where the demand for the data product starts to decline. This decline may be due to a variety of factors, including competition, changes in the market, or changes in consumer preferences. During this stage, companies may focus on managing the decline and minimizing losses. At this point, Investors may decide to divest from a data product that is no longer generating a return on investment. This may involve selling off the product or its assets, or liquidating their stake in the company that owns the product.

8. Retirement

The retirement stage is the stage where the data product is no longer viable and is removed from the market. This may be due to a variety of factors, including obsolescence, lack of demand, or changes in the company’s strategy. During this stage, companies may focus on decommissioning the data product and transitioning their customers to alternative solutions.

It is important to note that the stages may not always occur in a linear sequence, and the process may be iterative, where companies may revisit and refine previous stages as they progress through the cycle. Additionally, it is crucial to highlight that successful data product development requires significant investment in terms of time, money, and resources. Therefore, careful planning and execution of each stage are essential for a data product’s success in the market.

V. Importance of considering company size, industry, and maturity level

The data product life cycle is not a one-size-fits-all framework. Different companies, industries, and maturity levels may have unique requirements and constraints that must be taken into account when developing and managing data products. For example, early-stage companies may have limited resources and may need to focus on developing a minimum viable product (MVP) quickly to test their hypotheses. On the other hand, mid-stage and enterprise companies may have more resources and may be able to invest in more complex data products that offer more advanced functionality.

Similarly, different industries may have unique data requirements and constraints. For example, healthcare companies may need to comply with strict data privacy regulations, while financial services companies may need to ensure that their data products are secure and comply with regulatory requirements.

Finally, the maturity level of the company may also impact the data product life cycle. Early-stage companies may have a more flexible approach to data product development and may be able to pivot quickly based on customer feedback. On the other hand, mid-stage and enterprise companies may have more established processes and procedures in place for data product development and may require more rigorous testing and validation before deploying data products to production.

VI. The role of technology costs, speed, and maintenance in choosing a processing method

The choice of processing method used during the development of a data product can have a significant impact on its success. Different processing methods may have different costs, speeds, and maintenance requirements, which can impact the overall cost and effectiveness of the data product.

For example, batch processing may be a cost-effective solution for companies that have large volumes of data that can be processed offline. However, batch processing may be slower than real-time processing, which may impact the timeliness and accuracy of the data product’s insights. On the other hand, real-time processing may be faster but may require more resources and maintenance to ensure that it is functioning as expected.

Similarly, the choice of technology used during data product development can also impact its cost, speed, and maintenance requirements. For example, cloud-based solutions may be more cost-effective than on-premise solutions but may require a reliable internet connection to ensure that the data product is accessible. On the other hand, on-premise solutions may be more expensive but may offer greater control over the data and processing methods.

In conclusion, when developing a data product, it is essential to consider the data product life cycle, company size, industry, and maturity level, as well as the costs, speed, and maintenance requirements of different processing methods and technologies. By taking these factors into account, companies can develop data products that are accurate, reliable, and useful and can meet the unique requirements and constraints of their specific situation.

VII. Early-Stage Companies

A. Definition of Early-Stage Companies

Early-stage companies, also known as startups, are newly formed and often in the initial stages of developing and bringing a product or service to market. These companies typically operate with limited resources and are focused on rapid growth and scaling. They may have a small team of employees and a lean organizational structure. Early-stage companies are primarily focused on identifying and solving a problem in the market, achieving problem market fit, and creating a product or service that meets customer needs. They are high-risk ventures, with uncertain market demand and financial stability. These companies may rely on funding from investors to support their growth and development.

B. Characteristics of Early-Stage Companies

Early-stage companies often exhibit certain characteristics that differentiate them from more established businesses. These characteristics may include:

  • Limited resources: Early-stage companies may have limited financial resources, manpower, and infrastructure, which can make it difficult to achieve growth and scalability.
  • Focus on innovation: Early-stage companies tend to be focused on developing new and innovative products or services that address a specific need or problem in the market.
  • Flexibility: Due to their small size and limited structure, early-stage companies are often more flexible and adaptable to changes in the market or business environment.
  • High risk: Early-stage companies are often high-risk ventures, with uncertain market demand, financial stability, and long-term viability.
  • Emphasis on growth: Early-stage companies are typically focused on rapid growth and scaling, to become a successful and established business in the future.

C. Data Product Life Cycle at Early-Stage Companies: A Case Study of Uber (Before 2014)

Before 2014, Uber had a limited amount of data that could fit into a few traditional online transaction processing (OLTP) databases. Engineers had to access each database or table individually, and it was left to users to write their code if they needed to combine data from different databases.

Photo on Uber Blog

In 2014, as the amount of incoming data increased, Uber decided to build the first generation of its analytical data warehouse to aggregate all of its data in one place and streamline data access. To achieve this, Uber categorized its data users into three main categories: city operations teams, data scientists and analysts, and engineering teams. The first generation of Uber’s analytical data warehouse focused on aggregating all of Uber’s data in one place, as well as streamlining data access.

For this, Uber used Vertica as its data warehouse software because of its fast, scalable, and column-oriented design. Uber also developed multiple ad hoc ETL jobs that copied data from different sources into Vertica.

Photo on Uber Blog

D. Choosing the Appropriate Processing Method

For company size, Uber’s growth in terms of the number of cities/countries and the number of riders/drivers using the service increased the amount of incoming data. Therefore, a scalable solution was required, which could store and process large amounts of data. For industry, The transportation industry generates a large amount of data, and thus the a need for a data warehouse to store and process this data. For maturity level, Uber was at the beginning of its big data journey, so the primary goal was to unblock the critical business need for centralized data access or view.

Regarding costs, the first-generation data warehouse was expensive to scale as the company grew, so it started deleting older, obsolete data to free up space for new data. Therefore, the cost of the solution was a factor in determining the processing method. For speed, the ad hoc ETL jobs that copied data from different sources into Vertica made data access fast, often sub-minute. Therefore, the processing method should be able to provide fast data access.

Finally, in terms of maintenance, the lack of a formal contract between the services producing the data and the downstream data consumers made ETL jobs fragile, and the use of flexible JSON format resulted in the lack of schema enforcement for the source data. Thus, data reliability became a concern, and the processing method should provide data reliability and ease of maintenance.

Considering all these factors, the processing method chosen was a column-oriented data warehouse software named Vertica. The use of SQL as a simple standard interface enabled city operators to easily interact with the data without knowing about the underlying technologies. However, limitations such as data reliability, scaling costs, and lack of schema communication mechanisms were identified as the company grew.

E. The Limitations

The widespread use of Uber’s data warehouse and incoming data revealed a few limitations, such as data reliability becoming a concern and scaling the data warehouse becoming increasingly expensive.

Additionally, ETL jobs that ingested data into the data warehouse were also very fragile due to the lack of a formal schema communication mechanism, most of the source data being in JSON format, and ingestion jobs not being resilient to changes in the producer code.

To address these limitations, Uber was working on its Generation 2 Big Data platform, which was re-architected around the Hadoop ecosystem.

VIII. Mid-Stage Companies

A. Definition of Mid-Stage Companies

Mid-stage companies are businesses that have already established a solid foundation and have begun to experience growth in terms of revenue, customers, and market share. Generally, mid-stage companies have moved beyond the initial startup phase and have successfully demonstrated product-market fit.

B. Characteristics of Mid-Stage Companies

Some key characteristics of mid-stage companies include:

  1. Established customer base: Mid-stage companies have a stable and growing customer base, which helps to generate steady revenue streams.
  2. Proven business model: Mid-stage companies have a solid business model that has been validated through successful sales and customer retention.
  3. Scaling operations: Mid-stage companies are in the process of scaling up their operations to meet the increased demand for their products or services. They may be expanding their workforce, production capacity, or distribution channels.
  4. Increased competition: As mid-stage companies become more successful, they attract more attention from competitors and face greater competition in their industry.
  5. Focus on profitability: Unlike early-stage companies, mid-stage companies are typically focused on achieving profitability rather than just growth. They are looking to generate sustainable revenue streams and increase their profit margins.
  6. Experienced management team: Mid-stage companies typically have an experienced management team in place, with a track record of success in their industry.
  7. Access to funding: Mid-stage companies have usually already gone through one or more rounds of funding and have a proven track record of success, making them attractive to investors who are looking for more established companies with a higher likelihood of success.

C. Data Product Life Cycle at Mid-Stage Companies: A Case Study of Uber (2015–2016)

In 2015–2016, Uber was facing challenges in their first-generation Big Data platform due to their rapid growth. They faced challenges in terms of scalability, accessibility, and flexibility of the platform. To address these challenges, they re-architected their Big Data platform around the Hadoop ecosystem. This involved introducing a Hadoop data lake, which significantly lowered the pressure on their online data stores and allowed them to transition from ad hoc ingestion jobs to a scalable ingestion platform. They also introduced Presto, Apache Spark, and Apache Hive to enable interactive ad hoc user queries, facilitate programmatic access to raw data, and serve as the workhorse for extremely large queries respectively.

Photo on Uber Blog

To keep the platform scalable, Uber ensured all data modeling and transformation only happened in Hadoop, enabling fast backfilling and recovery when issues arose. Only the most critical modeled tables were transferred to their data warehouse. In addition, Uber made all data services in this ecosystem horizontally scalable, thereby improving the efficiency and stability of their Big Data platform. They also schematized all data, transitioning from JSON to Parquet to store schema and data together. To accomplish this, they built a central schema service to collect, store, and serve schemas as well as different client libraries to integrate different services with this central schema service.

With Uber’s business continuing to scale at light speed, they soon had tens of petabytes of data. Daily, there were tens of terabytes of new data added to their data lake, and their Big Data platform grew to over 10,000 cores with over 100,000 running batch jobs on any given day. This resulted in their Hadoop data lake becoming the centralized source of truth for all analytical Uber data.

Photo on Uber Blog

D. Choosing the Appropriate Processing Method

In terms of company size, larger companies with vast amounts of data may require more scalable processing methods such as Hadoop, while smaller companies may be able to work with simpler processing methods. In terms of industry, some industries such as healthcare may have strict regulations on data processing methods and may require more secure processing methods.

In terms of maturity level, companies with more mature data teams may be able to handle more complex processing methods such as Hadoop, while companies with less mature data teams may need to start with simpler processing methods.

Cost is also a significant factor when choosing a processing method. Hadoop can be more expensive to set up and maintain, while simpler processing methods may be more cost-effective. Speed is also a factor to consider, as some processing methods may be faster than others, depending on the needs of the company. Maintenance is also a consideration, as some processing methods may require more ongoing maintenance than others.

E. The Limitations

The re-architecting of Uber’s Big Data platform around the Hadoop ecosystem allowed them to address several limitations, including scalability, accessibility, and flexibility. However, as their business continued to scale and they had tens of petabytes of data stored in their ecosystem, they faced a new set of challenges.

The massive amount of small files stored in their HDFS began adding extra pressure on HDFS NameNodes, and data latency was still far from what their business needed. New data was only accessible to users once every 24 hours, which was too slow to make real-time decisions. While moving ETL and modeling into Hadoop made this process more scalable, these steps were still bottlenecks since these ETL jobs had to recreate the entire modeled table.

IX. Enterprise Companies

A. Definition of Enterprise Companies

Enterprise companies are large organizations that have a significant level of complexity, hierarchy, and multiple departments. These companies typically have a broad range of products and services and operate in multiple geographic locations. They are often characterized by a large number of employees, high levels of revenue, and a substantial market share in their respective industries. Scale is a crucial aspect of enterprise companies, as they must be able to manage and process large amounts of data and transactions to operate efficiently.

B. Characteristics of Enterprise Companies

Enterprise companies have a unique set of characteristics that distinguish them from other types of companies. These include:

  1. Complex organizational structure: Enterprise companies typically have a complex hierarchical structure with multiple departments and functional areas, making decision-making more complex.
  2. A broad range of products and services: Enterprise companies offer a wide range of products and services across various industries, often leveraging their size and scale to provide a competitive advantage.
  3. Large employee base: Enterprise companies employ a large number of people across multiple geographic locations, which requires them to have efficient HR and talent management processes.
  4. High levels of revenue: Enterprise companies generate significant revenue and operate on a global scale, often making them subject to various regulations and compliance requirements.
  5. Extensive data management: Enterprise companies deal with massive amounts of data from multiple sources and require efficient data management systems to ensure that data is properly processed, stored, and analyzed.
  6. Robust IT infrastructure: Enterprise companies require a robust IT infrastructure to support their operations, including large-scale databases, networks, and cloud-based systems.
  7. Strong brand recognition: Enterprise companies typically have strong brand recognition and market position, often making them leaders in their respective industries.

C. Data Product Life Cycle at Enterprise Companies: A Case Study of Uber (2017 — present)

From 2017 to the present, Uber rebuilt the Big Data Platform for the long term by addressing scalability limitations and data latency. The third generation of their Big Data platform had over 100 petabytes of data in HDFS, 100,000 Presto queries per day, 10,000 Spark jobs per day, and 20,000 Hive queries per day. The company’s research revealed four pain points, including HDFS scalability limitations, faster data in Hadoop, support of updates and deletes in Hadoop and Parquet, and faster ETL and modeling. To address these limitations, Uber built Hadoop Upserts anD Incremental (Hudi), an open-source Spark library that provides an abstraction layer on top of HDFS and Parquet to support the required update and delete operations. Hudi allows users to incrementally pull out only changed data, significantly improving query efficiency, and allowing for incremental updates of derived modeled tables.

Photo on Uber Blog

D. Choosing the Appropriate Processing Method

When choosing the appropriate processing method for the Uber case study data platform architecture, several factors need to be considered.

  1. Company size: Uber is a large company with over 22,000 employees, so the chosen processing method needs to be scalable to handle large amounts of data.
  2. Industry: Uber operates in the transportation industry, which generates a significant amount of data from various sources, including GPS data, user data, and transaction data.
  3. Maturity level: Uber’s Big Data Platform is in its third generation, meaning it is a mature platform that requires a processing method that can handle complex and diverse data processing needs.
  4. Costs: The chosen processing method should be cost-effective and align with Uber’s business goals.
  5. Speed: The processing method needs to be fast and able to handle real-time data processing needs, as Uber operates in a highly competitive industry where speed is crucial.
  6. Maintenance: The chosen processing method should be easy to maintain and support, with a reliable community and documentation.

In response to the pain points identified by Uber, the company chose to build Hadoop Upserts and Incremental (Hudi), an open-source Spark library that provides an abstraction layer on top of HDFS and Parquet. Hudi addresses several pain points, including HDFS scalability limitations, support of updates and deletes in Hadoop and Parquet, and faster ETL and modeling. Additionally, Hudi allows for incremental updates of derived modeled tables, which significantly improves query efficiency.

Overall, Hudi was a suitable processing method for the Uber case study data platform architecture. It addressed the identified pain points and was scalable, cost-effective, fast, and easy to maintain with reliable documentation and community support.

E. The Limitations to be Fixed in Generation 4

While Uber’s third-generation Big Data platform has improved data accessibility and processing efficiency, there are still some limitations that need to be addressed. Here are some of the ongoing efforts to enhance Uber’s Big Data platform for improved data quality, data latency, efficiency, scalability, and reliability, which will be fixed in generation 4:

Data Quality

  • The non-schema-conforming data is a significant issue when some upstream data stores do not mandatorily enforce or check data schema before storage. To address this, Uber is expanding its schema service to support semantic checks.
  • To ensure better data quality, the actual data content quality is also critical. While using schemas ensures that data contains the correct data types, they do not check the actual data values. Uber aims to improve data quality by expanding the schema service to support semantic checks.
Photo by Giflytics on Giphy

Data Latency

Uber aims to reduce raw data latency in Hadoop to five minutes and data latency for modeled tables to ten minutes. This will allow more use cases to move away from stream processing to more efficient mini-batch processing that uses Hudi’s incremental data pulls.

Data Efficiency

  • To improve data efficiency, Uber is moving away from relying on dedicated hardware for any of its services and towards service dockerization. This approach will enable better resource management and allocation.
  • Uber is unifying all of its resource schedulers within and across its Hadoop ecosystem to bridge the gap between its Hadoop and non-data services across the company.

Scalability and Reliability

Uber’s ingestion platform was developed as a generic, pluggable model, but the actual ingestion of upstream data still includes a lot of source-dependent pipeline configurations, making the ingestion pipeline fragile and increasing the maintenance costs of operating several thousands of these pipelines. Uber is working towards improving the scalability and reliability of its platform by identifying and fixing issues related to edge cases.

Conclusion

In conclusion, understanding the Data Product Life Cycle is crucial for developing successful data-driven products. By considering factors such as processing method selection, industry, and company size, businesses can create sustainable products that add value to their customers.

Data Product Life Cycle

We value your feedback and would love to hear your thoughts on this article. What did you find most helpful or insightful? What could we have done better? Let us know in the comments below.

We look forward to sharing the next parts with you as this is the 2nd part of the three-part series and hearing your thoughts along the way. Thank you for reading and for your time!

Read Also

From Batches to Streams: Streamlining Data Engineering for Comprehensive Internet Oversight (Part 3)

From Batches to Streams: Different Ways for Ingesting Data (Part 1)

Credits

- Written by Mohamed Awnallah

- Reviewed by Riku Driscoll, Zacharias Voulgaris, Stanley Ndagi

--

--

Mohamed Awnallah

Data Engineer with a strong understanding of the Data Product Life Cycle and fully passionate about contributing to Open Source.