The Agile Data Warehouse

Harlan Smith
Harlan’s Data and Analytics Blog
7 min readMar 17, 2016
Photo courtesy of https://unsplash.com/anthonydelanoix

Over the years, I’ve worked with many companies to apply agile software development to enterprise data warehouse (EDW) development projects. Data warehousing isn’t that different from other types of software engineering, so it’s no surprise that the basic principles of agile project management, as well as its various flavors like Scrum and Kanban, apply very well to database and ETL development projects.

But I do still continue to see a few common questions that come up on project after project, especially for teams who are new to agile data warehousing. Here are the four most common questions I get from agile data warehouse teams.

1. How do we handle solution architecture with an agile team?

Agile is all about breaking things into small chunks that can be completed in days or weeks. Solution architecture is all about making decisions and predictions about what a system will look like months or sometimes years into the future. Agile helps us deliver fast today, but solution architecture helps us avoid refactoring and technical debt later on.

On many traditional EDW development teams, one or more Solution Architects is put in charge of the overall design of the system. He or she is typically responsible for the high-level system architecture, as well as data models, coding standards, naming conventions, reusable frameworks, and patterns that should be followed by developers, as well as anti-patterns to be avoided. An architect might create architecture diagrams, logical data models, and other design artifacts that can help serve as blueprints for the development team.

Things are a little different in the agile world. In Scrum, for instance, we have roles for a Product Owner, a ScrumMaster, and Developers, but no Architect. So does this mean solution architecture itself is a relic of the past? Absolutely not.

In agile, solution architecture is not the responsibility of just one person; it’s the responsibility of the whole team. Developers in an agile team are responsible for communicating with one another not only what they are working on, but also how they are approaching the work. Critical thinking, design, feedback, consistency, peer review, and open dialog are not only encouraged, they’re expected. Senior engineers (as well as ScrumMasters and Product Owners, where appropriate) may sometimes play a bigger role in major design decisions, but generally speaking, everyone on the team needs to be committed to making a great product.

In agile, we tend to shy away from architects who don’t code. Instead, we encourage our developers to grow their solution architecture skills. This means getting ETL developers up to speed on EDW solution architecture, data modeling, coding practices, and how to deal with (or, preferably, avoid) technical debt and refactoring.

But there is definitely still a place for Solution Architects on agile teams. Larger teams may still find it useful to have a solution architect who oversees the many agile teams working on different parts of the same solution or ecosystem. In this capacity, the architect is responsible for driving a shared understanding of the system architecture, development techniques, and data model. Some teams may also find it useful to think of a solution architect as a technical product manager, responsible for negotiating with the product manager to prioritize technical foundation tasks that will help build a scalable system that won’t require costly rework or refactoring later on.

Solution architects on an agile teams will often:

  • Actively participate in “whiteboard” design discussions with product owners and developers to discuss implementation of product features
  • Perform early stage design reviews as well as hands-on code reviews with developers
  • Meet one-on-one with developers when they have questions about approach or architecture
  • Lead periodic training and documentation for solution architecture, data model, and development tools and techniques
  • Participate in backlog grooming, sprint planning, and sprint review

2. How are detailed design tasks like data modeling and source-to-target mapping handled? Who does them and when?

This is one I see come up a lot. In ETL development, we often start with a simple spreadsheet that maps how each field in a data warehouse is populated, including source fields and transformation logic. Developers will reference the document while coding their ETL, and will (hopefully) keep it up-to-date if they find any discrepancies or if they take a different approach during implementation.

So, is a source-to-target mapping document included in the user story as part of “Definition of Ready” before a developer pulls the story from the backlog? Or does the developer complete source-to-target mapping during the sprint as part of the ETL development story?

The answer is (wait for it…), it depends.

In large teams, a Product Owner may have one or more analysts working in a product ownership team to help gather and document detailed requirements like this. In these cases, the up-front data discovery and detailed, field-level mappings can be performed by one of these analysts and attached to the user story while it is still in the product backlog.

On smaller teams, a product owner may only be able to give a vague idea of where the data should come from, and a developer will need to analyze and query the source data to perform discovery and mapping on their own.

Both are absolutely valid approaches. What you don’t want, though, is ambiguity on who is doing what. If source-to-target mapping is part of Definition of Ready, make sure that this is agreed upon by the whole team and is consistently enforced. If a developer is doing mapping during the sprint, this also needs to be explicitly called out.

Just keep in mind, if mapping is done during the sprint, you’ll need to be ready for the occasional delay as data discovery can uncover unexpected data anomalies, or we may find out that the data as it is captured or stored doesn’t allow us to meet the requirement of the story.

3. How do we handle dependencies between user stories or tasks?

Data warehouses are chock-full of dependencies. We have ETL packages that are dependent on upstream source data, stored procedures that are dependent on tables, referential integrity constraints that are dependent on not only tables but also the data in them, and don’t get me started on views (especially views that reference other views).

We need a way to manage and track these dependencies so we can plan for them during backlog grooming and sprint planning, and so developers know what order to work on tasks during the sprint.

Many agile project management tools (like Jira and TFS, for instance) allow teams to explicitly list dependencies between stories, and even visualize a dependency map. This is something many teams don’t take the time to do, but believe me; it can be well worth the effort. If you know that you need to build your Customer dimension before you build your Sales Order fact, it will save you some refactoring work if you do them in the right order.

If you are using a low-tech approach like sticky note cards to track your user stories, this can be done by physically stacking the cards on top of one another, with the tasks that must be done first sitting on top.

Part of the difficulty here is that — going back to Definition of Ready — a user story in a sprint backlog is not supposed to have any blockers and should be ready for any developer to work on. To use our example above, one developer might pick up the task to develop the Customer dimension on the same day that another developer picks up the task to work on the Sales Order fact. What do we do in this case?

As with most things in Agile, it comes down to team members communicating. The developer working on the Sales Order fact can start by building everything he or she can that isn’t dependent on the Customer dimension. If he or she gets to a point where work is blocked, perhaps they can switch gears to give the other developer a hand to get the dimension completed.

Many agile EDW teams struggle with dependencies at first, but once developers are communicating with one another and keeping track of the dependencies during planning, this tends to become like second nature.

4. How do we do Data Warehouse DevOps?

OK, this one really is difficult. Unlike most software projects, databases are (by definition) stateful. The fact that we have to be mindful of all the data that’s already in the warehouse when we deploy solutions greatly impacts how we develop and deploy our code. It is relatively easy to diff your Java or .Net code with the production application build, it is much harder to diff your DDL code with a live database schema (this is why I’m so in love with declarative database development tools like SQL Server Data Tools), and harder still to diff the actual data in the EDW.

To be honest, data warehouse teams have a lot of maturing to do when it comes to DevOps. I have seen some great progress and there are many tried and true tools and techniques that have been used in object-oriented programming and web development for many years, which we have been slower to adopt. Again, not because we’re lazy, but because some of the concepts don’t apply as cleanly to multi-terabyte data warehouse environments where historical data and queries must be kept intact.

First of all, monitoring must be built in from the start. If you can’t monitor your system, you can’t manage it either.

Secondly, we need to shift our mindset from separate “Dev” and “Ops” teams and move toward a model where the Dev team is the Ops team.

Third, we need to start adopting the tools and techniques (test-driven development, build automation, continuous integration, etc.) that have been pioneered in OOP and web development. True, this is easier said than done, and it’s difficult to do justice to the topic in just a few short paragraphs. But investment in DevOps is one of the most important thing today’s agile EDW teams can focus on to improve quality and reliability of the systems we build.

--

--