Most startups understand the importance of customer data, but few believe they have enough people working in data-related roles. I see this with startups I work with and also in studies that point to a data engineering shortage.
Though I can’t solve this shortage, I did gather some benchmark data on how many resources companies of different sizes have in these areas. I was inspired by a recent Keen.io post on the data teams and data stacks of larger companies and wanted to do a similar study on companies who do not yet have the resources of Facebook, Netflix, Airbnb, and Pinterest.
I organized this study around 3 questions relating to different parts of the data stack as outlined in the data engineering post by Stitch Data:
- Data Consolidation: How many FTEs do you have that work as data engineers or business intelligence engineers? If some people split this role with other duties, such as software development, then feel free to include decimals for the rough amount of their role spent in this area.
- Data Warehousing: Do you have a data warehouse for customer-related data? If so, what platform does it use? (examples: Amazon Redshift, Snowflake, Amazon RDS for PostgreSQL, Google BigQuery, etc.)
- Analytics: How many FTEs do you have working as analysts? Could be people who do product analytics, marketing analytics, operation analytics, etc. Again, if someone also has other duties (like finance or marketing) then you can include decimals.
The people in the first question are the ones that setup data infrastructure. This includes the choices made in the second question, as well as any tools or code that bring data into a data warehouse. It is common for data engineers to also maintain and enforce consistency in data. Maintaining a ‘universal source of truth’ in key business metrics is critical to a company functioning properly and is often the responsibility of data engineers.
The people in the 3rd question are then the ones that utilize a company’s data assets. This could be information workers analyzing the performance of a business or data scientists creating and testing a new recommendation algorithm. People referred to in the 3rd question are dependent on work in the first two to do their job effectively.
Over 50 companies participated in this study. Answers to these questions were expected to vary by company size and sector, so I also collected data on who the company is primarily selling to (consumers or businesses) as well as the amount of funding they have raised. Funding is not a perfect gauge for company maturity as some have bootstrapped themselves to impressive heights, but more funding often allows for more hiring so this was the primary way I segmented companies.
Finding #1: You are Behind the Game If You Don’t Have a Data Warehouse
I have written before about the importance of investing early in data collection, so it was great to see that nearly 90% of companies in our survey have a data warehouse. Even more impressive was that so many of the companies with the lowest amount of funding have gone in this direction:
Setting up this foundation early will allow talented people to find the 2nd and 3rd level insights that can help identify product market fit or new opportunities to scale up in smart ways once that has occurred.
Finding #2: Cloud Data Warehouses Dominate with Redshift the Current Leader
Nearly 60% of those with a data warehouse use Amazon’s Redshift. Also popular were Postgres and MySQL (both often hosted on AWS) and a few companies use Google’s Big Query and Snowflake. Much less common are products like SQL Server and Oracle that would have likely dominated a survey like this a decade ago.
Performance, cost, and functionality of these solutions are all much improved from the options companies faced years ago. This has been a key enabler for companies to be more data-driven and responsive to customers and has allowed earlier-stage companies to more easily compete and gain share.
Finding #3: Even Early Stage Have > 1 Person in Each Category
Respondents with less than $10M in funding had on average 1.20 data engineers and 1.58 data analysts:
Answers with decimals were common with the < $10M funding category. At this level, people wear many hats including in the data world. That is understandable given resource constraints, however, I recommend assessing development team capabilities as a company grows to make sure technical debt is not accumulating before a full-time data engineer or data scientist can be hired. Data silos can be problematic and can result when companies are not being forward-looking enough with their infrastructure.
Finding #4: Resources Don’t Increase Until $20M in Funding
The prior chart showed little growth in both categories up through $20M in funding. Once companies hit that stage, however, both categories begin to increase as both the scale and complexity of businesses go up. Questions at this stage start to arise around centralized vs. embedded analyst resources, which is beyond the scope of this project but something to study as each comes with pros and cons.
Finding #5: Companies Get More Leverage from Data Engineers than Analysts
It was interesting to observe that analytics resources grew more quickly with scale than data engineers did. My theory on this is that the most successful companies combine a solid data infrastructure with a culture of data self-service where people throughout the company are empowered to seek answers to business questions. As a result, the incremental resources in the analytics group were often small fractions of large numbers of people rather than dedicated resources. This arrangement was referenced by a rapidly-growing marketplace company participating in our study:
“Our culture is really centered around data self-service, so a surprising number of people at (our company) actively write SQL and do their own data work.”
I experienced a similar environment at both zulily and Blue Nile. Without bottlenecks around data access, we could move quickly based on what we saw customers doing in many different functional groups. This contributed to a culture of critical thinking that is common with high-performing companies.
We continue to believe that data will be important if not the most important currency for companies as they grow and that the focus on this area in the early days of a company will only increase as new technologies come online. Already we see services that make it easier to implement these solutions — but that does not negate the need for professional data engineers.
In fact, while improvements in ETL tools and cloud-based data warehouses will allow companies to do more with less, there are also trends that will increase the need for more people working on data. Both consumers and businesses will expect their products to be smarter through artificial intelligence and that will require increasing investments in both categories of people to do so effectively. Additionally, data integration is likely to take on greater importance as companies choose to use best-in-class 3rd party tools that require data flowing through them to reach their potential.
We will continue to monitor how these technologies and the need for resources evolve over time. Please share any observations you are seeing about data with me at: email@example.com