Redesigning Data and Analytics at Managed by Q

Tim Finkel
DO NOT ERASE.
Published in
7 min readFeb 1, 2019

There are a lot of great resources out there about how to design the data and analytics function for an organization. You should do X, and then Y, and finally Z and you’ll be data-driven. Success is never linear, however, and there are fewer discussions about designs that have seen success but encountered problems.

In this post, I’ll discuss why we redesigned data and analytics at Managed by Q in 2018, how we did it, and where we’ll go from here.

The good old days

Historically, we believed that by embedding analysts from a central team with specific functional areas of the business (i.e. product, marketing, operations, growth), we could build an analytic competence across the organization. First, analysts would work with individual teams to provide access — lightly modifying data pipelines to expose information in our business intelligence tool (Looker), helping build reports, and working with users to turn raw data into valuable insights. We believed that over time, analysts would progressively increase their impact, performing activities that they were uniquely suited to and surfacing ever more valuable insights to the organization.

Like any good strategy there were tradeoffs, things we would do, and things we would not. We optimized for agility and immediacy of impact, liberally hacking together solutions, while making the minimal investments in infrastructure necessary to get the job done.

We kept our interface with the product and our engineering team informal. We used third party tools to copy data into our data warehouse, and it was easy enough to fix broken transformations in the data warehouse that popped-up as our product evolved. The engineering team would usually give us a heads-up and it was easy to keep an eye on what was going on.

Our strategy was working great. As our organization and product evolved over time, teams across the organization used data to make critical decisions quickly and accurately, and analysts moved from one functional area to another, progressively leveling-up our capabilities.

The crossroads

The appetite for information across the organization was insatiable and it was often difficult for our data team of three to keep up. Many teams needed data to do their jobs — to understand and manage our partner base, to learn about how users engaged with our product, and to manage performance to plan.

Thankfully, a handful of teams became proficient enough to self-serve on reporting and analysis with minimal help from the data team. In some cases an individual’s role expanded to include some analysis. In other cases, team members in other parts of the organization became de facto analysts.

Over time, our engineering team also tripled in size. As the frequency of changes to our underlying data models increased, the data team was being notified less often. This led to more breaks in the data warehouse, which took longer to address.

Significant, systemic problems began to materialize in early 2018. We had become more reactive than proactive and our mission as a team had become unclear.

Analysts were spending a majority of their time reacting to problems

  • Looker users were reporting inaccuracies and missing data in dimensions and metrics that should have been detected earlier in the data pipeline
  • Redshift would frequently become overwhelmed with queries, leading to outages that would last hours in the middle of the workday
  • Scheduled reports, dashboards, predictions, and other applications would fail silently, until an end-user reported a problem or someone stumbled across it

The (un)reliability of our systems didn’t only lead to fighting fires, it also led to a lack of trust in the dimensions, metrics, reports, and dashboards analysts or end-users created. In turn, this led to a never-ending barrage of requests to verify reports, and questions about sources of truth and their veracity.

Taken together, it was clear we needed to make investments in our underlying systems to increase their reliability, rebuild trust, and get our team out of a vicious cycle of reactivity.

Our team’s mission became unclear

We were happy to see analytic capability spreading to various pockets of the organization. Other functions (i.e. operations) were owning more of their own reporting and analysis, taking some responsibilities off the plate of the central data team. This led to questions about what was expected from the central team.

Our purpose had become clouded, and the responsibilities that should be entrusted to us or the decentralized functions were unclear. For example:

  • Who owned the definition for a given metric? …and who was responsible for ensuring quality of the underlying data?
  • Who was responsible for maintaining team-specific dashboards or running reports?
  • If someone has a question about a dimension or a report filter, who should they reach out to?
  • How could we best leverage our data to understand and differentiate our business?

Redefining our purpose

In light of significant reliability issues and decentralized analytic teams, we needed to redefine what the responsibilities of the central data team should be in ensuring fast and accurate decision making. We decided on two purposes: rebuilding a trustworthy data platform, and generating more valuable insights.

Rebuilding trust

With a redefined purpose and a clear mandate, we got to work addressing the underlying issues in our data platform. In short, we needed to increase visibility into issues, detecting them earlier and mitigating them faster. By doing this, we could facilitate a decentralized analytics organization.

The transformation layer

Through early 2018, our transformation layer was made-up exclusively of Looker derived tables — hundreds of them . There were no automated checks to validate our assumptions, the accuracy of our transformations, whether they were refreshing at all, or how performant they were. In order to build trust we needed to detect issues before users did, and our lack of visibility and controls at this stage was a major impediment to doing that.

We adopted the dbt framework to help us address the issues in our transformation layer and give us visibility into what was going on. It was a good fit for us because of robust testing functionality, minimal infrastructure requirements, and being SQL-based, making the transition from Looker derived tables straightforward.

It took quite some time, but as we moved more and more of our critical transformations to dbt, we gained ever greater visibility into the validity of our assumptions, along with the accuracy and performance of our transformations.

Other infrastructure

The data team was also responsible for a smorgasbord of critical applications running on a handful of EC2 instances doing things like moving bespoke data sources into the warehouse, updating predictive models, and pushing data to third party tools. This infrastructure made it hard for us to detect problems. We’d hear about issues from end users when outputs were suspect or not present, burning trust, and accelerating the cycle of reactivity.

Over several months and in parallel with our investments in the transformation layer, we moved these applications to our common engineering infrastructure and improved instrumentation in order to detect and respond to issues faster, increasing reliability. Leveraging the infrastructure that already existed was helpful for us because monitoring tools came built-in or easy enough to bolt-on, and additional features like continuous integration and automated deployment enabled us to take advantage of the learnings and investments of our engineering team.

Evolving culture

In conjunction with the technical investments outlined above, we made significant cultural changes within the team and in our relationship with other parts of the organization.

As we started detecting issues earlier, we needed to ensure we were responding to them quickly. We focused on continuous improvement and the proportion of issues we were detecting through a combination of bugs, requests, periodic analysis of trends, and postmortems for incidents impacting our users. Over time we expected to detect most issues before users did, and we expected to detect more granular issues with a smaller impact.

Validating assumptions around our inputs improved our interface with the engineering team significantly. At first, our automated tests caught more granular issues, letting us proactively address what users previously would have reported. Often, to fix these issues, we’d need to talk to an engineer or two and figure out what happened, or how the new system worked. Over time a new pattern started to appear — we were consistently being notified before changes were made, giving us time to preemptively address these changes with no end user impact.

At the same time, we worked with functional teams to begin transitioning more reporting and analysis to them, helping them self-serve on more requests. We invested in education, systems to facilitate collaboration, and targeted improvements to specific data models that were impediments to self-service.

(On the way to) the promised land

Like most changes, the early stages of this change felt more like a fire drill than the path to success. As we gained greater visibility, we uncovered problems we didn’t know existed. As we added tests and monitors, some were too brittle or noisy, leading to interruptions and false positives. We made mistakes as we adopted new technologies and updated our infrastructure. Nonetheless, we carried on.

Almost imperceptible at first, our vision eventually came to life. The crux of our thesis — that a reliable data platform would free us up to tackle unrealized opportunities — was validated and our team was again focusing on generating unique and valuable insights. We are using natural language and behavioral data to develop a better understanding of the communications happening on our platform and improve the user experience. We’ve started building expertise around the pricing of our services based on quoting and transactions happening on our platform.

Across the organization, people have better access to the data they need than ever before. We’re seeing wins from self-sufficient teams using data to generate and validate hypotheses, identify opportunities, optimize spend, and improve efficiency. We’re also seeing substantial reductions in cycle time, maintenance and support, with significant increases in reliability.

There have been and will continue to be setbacks on the road to data and analytics success. In 2018, we took a step back to clarify the role of our team, made substantial changes, and set ourselves up to provide significant value into the future.

--

--