Own your own logging (and impact)

Published in

Deliberate Data Science

8 min readFeb 20, 2019

Autonomy plays a large role in career satisfaction (ref). As a Data Scientist, I’ve started to wonder if I have autonomy in my career. I’m lucky enough to have talented coworkers, reasonable managers and some leeway to choose what I work on. Plus, Data Science still is in high demand … so that means Data Scientists have high autonomy, right? Nope. Not at all.

Data Science as a field is not set up for autonomy. At least not real autonomy. We may have superficial autonomy over scheduling and which problems we think we should work on. But for a large class of Data Scientists, our work ends with a data-supported recommendation and depending on a multitude of factors outside our control, that recommendation is either acted upon or perhaps not.

I am pretty sure that means we’re Product Managers who are good at data wrangling, but lack the organizational authority to drive outcomes and have real autonomy.

So … what do we do?

Autonomy is earned through impact.

Roughly speaking, having impact means “driving” a set of tasks to completion that produce a positive change. Do this enough, and any reasonable organization will grant you the autonomy to execute your ideas. But how do you “drive” an outcome? Two ways:

1 — do one of the tasks yourself

2 — convince someone else to do it

Until you’ve gotten a lot of authority (via impact), it’s going to be pretty hard to convince others to implement your ideas for you. So that means you’ll have to do it yourself. And that’s where Data Science falls short. More often than not, Data Scientists (and our employers) draw boxes around the tasks we can perform, limiting us and our impact.

I hope to convince you that as Data Scientists, we can do more and that by broadening our scope, we can have greater impact and earn more autonomy.

Learned Helplessness

Many Data Scientists come from non-engineering backgrounds where, sure we can code, but we’re not engineers. This leads to a natural box around the kind of work that we do. Commonly, this is a standard set of tools like notebooks & SQL or other data UI’s. Maybe throw in some R or python scripts for good measure.

Further, Data Scientists go through company-specific training for Data Science but often not for engineering and often do not have fundamental knowledge about how non-DS technology (web applications, networking, APIs, etc.) works.

When our ideas require more than our current limited toolset, we throw up our hands and ask for help from engineering. Doing these tasks ourselves is too complicated. Or worse, we may look down on these tasks as “inefficient” uses of our time and opt to take on the dubious role of “thought leader”.

In this way, our organizations and we ourselves claim that we are not capable of having the impact we would like, and need to be supported by engineers instead. Sometimes then, our ideas won’t get executed — resources are valuable and scarce. And sure, you can go along with this cultural expectation. You probably won’t be criticized for it. But that kind of attitude is why Data Scientists aren’t set up for impact or true autonomy.

I’ve been in that position, and it sucks. But over time, I’ve come to see that engineering and data science are comparably complex. It’s not so scary over that wall after all. And every time I’ve rolled up my sleeves to learn something that “wasn’t my job” I’ve benefitted and so have the organizations of which I’ve been a part.

Enter, the full stack (aka autonomous) Data Scientist

Full Stack Data Science

Let’s stop all this abstract talk and get down to brass tacks. In the next section I’ll go through some examples of what Traditional Data Scientists are expected to do today, compared with what a Full Stack Data Scientist would be able to do and what additional knowledge they’d need in order to be able to do it. At the end I’ll go through some ways to start being more full-stack.

Running Example: re-engage users who have not made a purchase by showing them educational materials.

Data collection (logging)

Sometimes for a new project to get off the ground, we just need a bit of extra information from production systems. In our re-engagement example, we might just need to know that there are enough users coming back to make the design and engineering effort worth it. Fortunately, most companies have some sort of logging pipeline in place.

Traditional: capable of communicating clearly with engineering to make the request for logging. “When a user reaches the buy page and has no prior transactions, fire an event with parameters … ”

Full Stack: understands product codebase and data collection infrastructure well enough to add own data collection code to the product. Adds logging to production code and sends to engineering partners for review.

Aka. “owns their own logging.” To do this, a full-stack data scientist would need general understanding of how web applications work, where the code lives, how to call existing logging libraries, and how to initiate a code-review.

Data processing (ETL / Big Data Jobs)

Once we have logging to tell us when users show up that are eligible to be re-engaged, we may need to regularly add additional dimensions (e.g. user country) and create roll-ups.

Traditional: communicates clearly with Data Engineering on the input log, additional dimensions, and output aggregates desired.

Full Stack: writes own data processing jobs, monitors them for success and resource consumption.

To do this, a Full Stack Data Scientist must be capable of interacting with and developing Data Engineering tools (e.g. airflow, mapreduce, spark). While many traditional data scientists may have some interaction with these tools, most do not use them to their full potential — Data Engineering tools enable autonomous execution of your ideas and can automate many traditional data science tasks.

Experimentation (AB Testing)

So now that we have some idea of how many of our users are showing up, we can justify some design & engineering effort to run a simple AB test.

Traditional: configures experiment (variants, rollout percentage, etc.) via some UI, and defines metrics to determine efficacy of treatment group.

Full Stack: integrates with AB Testing randomization mechanism and modifies existing codebase to implement the re-engagement experience.

Similar to logging, this involves a thorough understanding of the underlying technology being used including any existing AB testing infrastructure.

Of course, it will not always be possible even for full-stack Data Scientist to implement complex experiments without also becoming full fledged Software Engineer. But even when they are not doing the actual engineering, a Full Stack Data Scientist will be engaged in the engineering process to understand the user interaction with code, experiment exposure point, unit testing included and implications of design choices made by engineering.

Data products (modeling)

Taking this project to the next level, we may want to create a model that predicts whether or not a user will respond to one of several different re-engagement strategies and show them the one they are most likely to engage with.

Traditional: ingests previously captured data and produces proof-of-concept model that must be translated into production by engineering.

Full Stack: builds production model and adds appropriate monitoring to measure efficacy.

To do this, a Full Stack Data Scientist must be capable writing production code that wraps a model, feeds it input data and interprets its output. This is fairly difficult from an engineering perspective. Some organizations will have Machine Learning Platform teams that help automate this process, but again a full-stack Data Scientist will be involved in the entire process so they fully understand what will happen in production.

Incremental Progress

You might think that I’m simply advocating all Data Scientists become engineers in addition to their day jobs. It’s true that Data Science and its associated deliverables are moving from “science” towards “technology” and may well become a sub-specialty of engineering. But we are some years off from that, so I’ll end with a couple concrete steps that have helped me become more Full-Stack:

Crash engineering education: many companies offer some kind of “bootcamp” classes for engineering new-hires. Ask to be included officially and even if that doesn’t work just show up to the classes anyways. You may not understand everything that’s being explained right away, but the concepts will start to solidify.
Learn the infrastructure: If you work on a web application, figure out how web applications work. Build a tiny one on your laptop to make sure you understand the core concepts. Be aware that enterprise level infrastructure can get quite complex, so give yourself a sandbox to play in before trying to understand a larger systems.
Selectively go deep: once you’ve got the high level flow of a technology, go deep on selected areas. For example, this amazing article on internet infrastructure https://github.com/alex/what-happens-when is the kind of deep understanding, but likely you’ll want to pick and choose where you go deep so as not to be overwhelmed
Code-review: engineering changes go through code-review. Get involved by asking to be included. Ask questions (prefix your questions with “Asking to learn” so it’s clear what you’re doing) — most people are willing to help you understand.
Make tiny changes: find a tiny feature or improvement to a system you work with and make the tiniest change you can possibly think to make. This will get you started and following standard contribution practices. If you can’t think of any, ask system owners for “new hire” or “bootcamp” tasks that are designed to get you rolling.

More generally, the transition from Traditional to Full Stack takes a long time (we’re all still learning!) but with a little change in perspective we can make incremental progress towards becoming more Full Stack. Instead of throwing up our hands when something’s not getting done, let’s trust that we can learn complex systems that fall a bit outside our traditional domain. And in so learning, get more work done, have more impact, and become autonomous contributors.