How a Data Organization Evolves

Sequoia
Sequoia Capital Publication
7 min readApr 4, 2019

In order to build a strong data-informed company one needs to both build a world class team that focuses on impact and create a great culture that truly understands the value of data. In Why Data Science Matters, we discussed the importance of data science and why companies are increasingly focusing on using data to help build products. In this article, we will explore how products evolve over time and how data infrastructure, teams and organizations evolve with it.

As more products are built, more internet-connected devices are purchased, and more time is spent online, the volume of user-interaction data increases dramatically. Simultaneously, the virtuous cycle of more A/B testing and experimentation leading to faster product iteration — which in turn leads to accelerated development releases, which then compounds product growth — has fueled companies’ internal demand for insights from that data.

A business’s ability to compete is increasingly driven by how successfully it applies analytics to vast, unstructured data sets and how the insights from those analyses drive innovation. As a result, it has never been more important to build an organization that generates business value from data.

A necessary ingredient to take advantage of data is a data organization that is responsible for all outcomes enabled by data. This includes three primary objectives:

  1. Evaluate the health of the business. Monitor key product metrics; understand the drivers of changes in those metrics and identify outliers; build and analyze dashboards, reports and visualizations.
  2. Ship the right product. Design and evaluate experiments; segment users and build models of their behaviors; power production systems using artificial intelligence and machine learning.
  3. Set roadmap and strategy. Deeply explore and analyze the user journey; generate actionable insights and forecast phenomena.

DATA ORGANIZATION STRUCTURE

To achieve these outcomes, the right infrastructure is needed. We’re going to walk you through that structure using the diagram below.

The first step in the process is to log all user interactions with a product — every click, hover, open, close, and login (plus any metadata) that takes place on apps and the web, as well as any data from third-party providers. Generally, the size of this data scales quickly as the number of users and their engagement increases.

Even though very little of this data is utilized in a meaningful way, logging is a critical step in the process. Companies generally don’t know what data they’ll ultimately end up needing, so it is simplest to log everything. Certain types of logged data then need to be streamed in real time to be useful, for example in fraud detection and live video.

This raw log data, in addition to data from 3rd party providers and other transactional systems, then transformed, aggregated, and loaded (via a process known as extract, transform, load, or ETL) into a data warehouse, which stores the data in a more structured (usually SQL-backed) form. Some larger companies opt to preserve all incoming data in their raw form in a data lake (a centralized repository that stores all data), from which they are able to rehydrate downstream data stores with updated logic.

Many mid-sized and large companies have multiple data warehouses and data lakes, making direct analysis of the data intractable without integrating data sets. As a result, these data sets undergo another ETL process targeted for a specific use case (e.g., advertiser growth data). The outputs are then stored in an analytics database used for conducting deeper analyses, constructing reports and visualizations, and building artificial intelligence and machine-learning (AI/ML) models. The insights from these analyses help drive roadmap and strategy, while visualizations and reports help with monitoring the progress of a product and AI/ML models assist in automation and prediction.

Additionally, in the test-and-learn approach (which is the key to building any data-informed product) to product development, products are built and customized based on the user behaviors that are tracked. Large sets of product experiments (e.g., A/B tests) are run, evaluated, and implemented based on their impact on key metrics. In these experiments, feature flags segment users and ensure that different user groups receive varied treatment.

As the data stack has standardized, multiple data-related professions have emerged, including data analysts, data engineers, data infrastructure engineers, data architects, and data scientists. The creators, end users, and data products vary across different segments of the stack (see chart below).

EVOLUTION OF DATA ORGANIZATIONS

The function of a data organization should evolve with the growth of the product. For example, hiring data engineers who specialize in petabyte-scale data is probably not valuable at an early stage but might be as a product gets more use. While, resourcing for data teams should be guided by near- to medium-term needs, infrastructure should be built for long term needs.

Activities such as counting the number of users across various products and features might provide immense value early on, but over time, the data team’s scope should encompass far more. Below, we outline how responsibilities are defined based on the primary tasks for specific stages of development.

Most organizations rely on their data teams to count numbers or provide dashboards. Only a few are consistently running experiments to improve products, and fewer still are leveraging data-informed analyses to guide their goals and roadmaps.

At the outset, when organizations are counting numbers, the core skills required are excellence in technical execution. Setting up the infrastructure to reliably generate KPIs, creating data stores to track these numbers over time, and building basic reporting require strong technical competency. For most companies, the product team is often the de facto first iteration of the data team. They define metrics and calculate and store data around those metrics as product usage ramps up.

As the company and product evolve, dashboarding and reporting become extremely important. This is when data engineering becomes a core function independent of product engineering, and infrastructure is created specifically to power ETL and reporting functions. It is at this point that having a deeper product mindset becomes a more important skill set to add to a data team. In addition to monitoring KPIs and providing reporting to the rest of the organization, performing ad hoc analyses to identify root causes of metric deviations becomes a core responsibility of the data team.

Once the product has achieved enough scale so that experimentation is both possible (statistical significance) and critical to improve product experience, statistical skills become important for both data analysts and engineering teams. For data analysts, it is crucial that experiments are well-designed and that interpretation of the results is statistically correct. On the backend, a framework for experimentation will need to consider things like user tracking (so that the same user is not part of multiple related experiments) and other statistical features that enable quick interpretation of results. More on this in a future post.

Finally, the most important leverage of the data science team is to help set goals, roadmaps, and strategies. Setting the right goals requires a good understanding of the overall business objectives. Setting roadmaps require the ability to perform exploratory analysis that is able to identify issues and opportunities and connect insights to outcomes. Specifically, understanding the drivers of any phenomena, the levers available to make changes, and linking the insights to a set of possible actions. It is impossible to do this well without excellent domain knowledge and an analytical mindset that is drawn to outcomes.

Additionally, setting strategy for the product team requires strong ability to understand all the related phenomena (“the dots”), to identify and understand how the dots are connected and to recommend a strategy that makes the most sense. Finally, communicating effectively and with clarity to senior leadership is important to the eventual implementation of data-informed methods to set goals, roadmaps, and strategies.

TAKEAWAYS

  • Thoughtful understanding of data infrastructure as well as hiring the right data talent at different stages of the product life-cycle contributes to a data organization’s success.
  • A data organization is responsible for three outcomes: monitoring business health; shipping the right products; and setting product goals, roadmaps and strategy.
  • A data organization’s role should evolve up the value stack over time, from counting numbers to eventually setting roadmaps and strategy for the product.

This work is a product of Sequoia Capital’s Data Science team. Chandra Narayanan, Hem Wadhar and Ahry Jeon wrote this post. See the full data science series here. Please email data-science@sequoiacap.com with questions, comments and other feedback.

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by +439,678 people.

Subscribe to receive our top stories here.

--

--

Sequoia
Sequoia Capital Publication

From idea to IPO and beyond, we help the daring build legendary companies. Follow our publication for more Sequoia perspectives: https://seq.vc/Sequoia-pub