How Data Scientists Can Work Better With Engineers
Learning the right skills, owning investigative work, and collaborating on abstractions can go a long way
“No update: the pipeline is broken, and we’re doing our best to fix it.”
It’s the first Tuesday of the month, and your Finance team is hounding you about those end of cycle revenue numbers. You’d help them, but you can’t focus — your dashboards are crumbling to the ground one by one as error messages fill up your browser window. Your Slack channels begin to fill up with angry messages from Product Managers about their need for up-to-the-second data about some product that the company deprioritized months ago. You try to escape but to no avail: you have a bi-weekly sync with your fate…erm manager.
This is a Data Analyst’s nightmare. Data Pipelines are the lifeblood of the team: ETL jobs populate the tables you need to communicate results to stakeholders and, well, do your job. But an unfortunate truth of data operations is that these pipelines break, a lot, and fixing them isn’t always simple. Traditionally, this has all been the job of Data Engineers — but this disconnect between stakeholder response (on the Analysts) and the actual fix (on Engineers) leads to some frustrating situations. Oh, and also: nobody wants to spend their days maintaining data pipelines.
In my experience as a Data Analyst / Scientist, my teams have tried out a few different paradigms for how to address these ownership and communication issues, and I want to explore a few areas where small fixes can make life easier for everyone without broad team reorganizations. Everything that follows comes with an obvious caveat: every team and company is different, so results may vary.
It’s not realistic to ask Data Analysts and Scientists to own their entire production pipelines
Specialization exists for a reason: it works. While modern data infrastructure is getting easier to build and maintain in theory, layers of abstraction can make things increasingly opaque. That’s why it’s important to define exactly what we mean when we say “own pipelines” or “be vertical.”
To start off, it’s not realistic to ask Data Analysts and Data Scientists to own or debug core parts of data infrastructure. For a popular example, take a look at HDFS, the standard distributed file system that tons of data teams use to store files. In addition to only having an interface in Java, setting up and maintaining a modern HDFS deployment requires knowledge across distributed systems, networking, and general cloud infrastructure. Is it realistic for every team member to have that? No, it’s not. The same logic applies to other parts of a mature data stack too: Analysts and Scientists probably aren’t going to be able to build or debug infrastructure problems running Airflow on Mesos.
So what is realistic? As you move up the stack things become easier to digest, and Analysts and Scientists can begin to understand what pipelines exist, what they’re built on, where they can fail, and what tables they populate. Back to our previous examples, the actual HDFS filesystem can be navigated and understood through their Web UI, and Airflow makes it easy to understand ETL job dependencies and see what went wrong.
To make use of this information in a way that actually improves data workflows and makes life easier for Data Engineers, we need to define an important concept: the Data Skill to Investment Matrix.
Continuous Learning and Defining the ”Data Skill to Investment Matrix”
One of the best ways that data teams can help their engineers is by pushing themselves to learn new skills. It’s hard to overstate how important this is: “continued learning” is getting more popular among all types of teams, but on data teams it’s critical that team members develop new skills over time that enable them to more effectively interact with a company’s data infrastructure.
One of the best ways that data teams can help their engineers is by pushing themselves to learn new skills.
A unique part of data teams is that everyone is “technical” to some degree: everyone on the team will be writing code on a daily basis (even if it’s only SQL). That opens some interesting doors — there’s a lower barrier to learning how to interact with new technologies and languages. Especially as it becomes more in vogue for everyone to know Python or (increasingly) R, team members can learn the information they need to build new data skills. The issue lies with deciding what to learn and how, and that’s what the Data Skill to Investment Matrix is all about.
Unlike the now-infamous Harvard Business Review Data Matrix, this one is actually useful and looks at very specific skills. “Data Warehousing” is not a skill — reading ETL Python as an Analyst is.
The sweet spot for deciding where to focus is a high data skill to investment ratio — that means that with relatively little time investment, you can develop skills that are disproportionately helpful. Consider two opposite examples: learning how to work with HDFS, and learning how to read the Python code that runs some critical jobs.
Learning to understand core Python jobs has a high data skill to investment ratio because it:
- Requires a low-to-medium investment of time
- Will help across many pipelines in many situations
- Is a jumping off point to other important Python-related skills
Depending on what your data infrastructure looks like, ask your Data Engineers what would be most helpful to them if you were able to learn how to do it, and apply this framework to triage where to start.
A mistake to look out for — one that I repeatedly make — is going too far and encroaching on what Data Engineering wants to spend their time on. Your goal should be to work effectively with them, not to steal their job. If they find it helpful for you to work on basic tickets, great: if not, don’t. Additionally, you’ll want to think about how the skills you choose to learn work with your long term career plans and what’s important to you as an individual.
“The Ticket Minimum” and Bug Communication
A great place to put this work to practice is in how the team communicates when pipelines break. Data Analysts and Scientists should make it as easy and straightforward as possible for Data Engineering to fix broken pipelines or data issues. Teams can design information flows that place the initial burden on stakeholder facing roles to handle data issues before escalating to Engineering.
Data Analysts and Scientists should make it as easy and straightforward as possible for Data Engineering to fix broken pipelines or data issues.
Accordingly, what I like to call “The Ticket Minimum” is the minimum amount of work that stakeholder-facing roles like Data Analysts and Scientists should put in to investigate data issues. The general rule is that every JIRA ticket (or whatever software you use) should at minimum contain:
- A detailed explanation of what the data issue is: which column is missing data, what dates it’s missing, etc.
- The source of the problem: missing partitions in HDFS, an incorrect backfill, etc.
- Links to the appropriate ETL jobs
This might take a lot of investigative work, but it makes it much easier for Data Engineering to take action once you’ve done the “grunt work” of finding the actual issue. This is another strong argument for Analysts to pick up whatever language your infrastructure lives in (probably Python), because without it it’s going to be difficult to isolate which jobs are causing problems.
Working Together to Create Abstractions
Another area where Data Analysts and Scientists can improve their workflows with Data Engineering is to find areas where abstractions can improve workflows and communication.
For an illustrative example, our team uses Airflow to schedule and run most of our ETL jobs. Airflow is run mostly in Python and has its own concepts that are generally foreign to most Data Analysts and Scientists, so creating new jobs and new tables is a multi-disciplinary effort. Our Data Engineering team created an abstraction — called dag-factory — that simplifies the process and makes it possible for everyone on the team to create jobs.
Data teams should work together to find areas where abstractions can improve workflows and communication.
With dag-factory, you can schedule data-related DAGs (ETL jobs in the simplest sense) using YAML instead of Python.
This is a pretty involved abstraction, but it doesn’t need to be this complex. If as an Analyst or Scientist you find yourself making consistently similar joins or running into some problem, document it and see if a unified solution would make sense.
The Basic Rule: Put Yourself In Their Shoes
We started with the basic rule of Data Engineering: nobody enjoys maintaining data pipelines, but somebody needs to do it. Teams can distribute this responsibility by pushing Data Analysts and Scientists to take on more engineering work where it’s realistic: through learning new skills, doing the heavy lifting on investigation, and thinking through where abstractions can be helpful.
I’m skeptical of the vision that some have of Analysts and Scientists completely owning their own pipelines and production infrastructure, but some small things can get us part of the way there.
(To continue the discussion, find @jgage718 on Twitter)