This Is How You Build a Data Workflow

Or how I learned to stop worrying and love the journey.

Published in

The Startup

7 min readJul 14, 2020

Once there were two sons of two entrepreneurs. They both decided to follow their fathers’ examples and build businesses out of ideas and relationships. One became a salesman by selling products; the other became a salesman by learning data science. I don’t know which was wiser, but I know which was surprised when he realized he was a salesman. The harder he looked for meaning in data, the more important it became to speak clearly and simply.

I was asked a question on Quora:

When did you have to analyse data and give recommendations based on your result?

Someone wants to know about the workflow when doing data work. No? So I tell her my version. I don’t know if my version is a better version than all the others, but I know it is a good version. It works for me. I learn from data. That’s the point, isn’t it? I sometimes convince people to move forward. That’s my job, to move people forward for good reason. Making a recommendation is easier than making a difference with a recommendation.

How do I do it? I provide at least one data answer every day, though that pace slows down when I’m doing deeper work. Since I’m currently working at a SaaS startup, my answers lead us to better relationships with customers, better use of our time and resources. A lot of the work is exploratory analysis, figuring out a generally accurate but imprecise answer.

Proportions

It’s important to know that a lot of my day leads up to the big moments. Most of it is setting up the problem and sometimes I learn enough to leave a problem alone before I did anything fancy.

About 60–70% of my time is spent on data engineering: finding, cataloging, securing, indexing, or storing data. I work with streams, spreadsheets, and databases.

Around 20–30% of my time is spent on data analysis: exploring, visualizing, and summarizing data. Often I create baseline models to get the general shape of the data, the relationships between factors, and an idea of problems with incomplete data, outliers, or other problems.

The rest, between 0–20% of my time, is spent with machine learning. Often baseline models are good enough to make a decision. If I do some machine learning, a lot of it is supervised learning, especially working with categorical data. I rarely process natural language or images. I even more rarely build reinforcement learning models. That’s this project. Other projects use other types of work.

The important thing is the proportion of the work. Set the expectation that you’re working with data, and you won’t be disappointed. Set the purpose to learn from data, and you’ll often be satisfied with accurate but imprecise answers. You have to learn to be satisfied with imprecise answers, because that’s the pace of life. People are waiting to use what you have. If they’re not, either you don’t really know anything at all, or you haven’t built a working relationship with decision makers yet.

Examples

A lot of companies break this work up differently.

Uber has tried pairing data engineers with machine learning experts. More commonly, they use machine learning experts like consultants, temporarily lending them to a project. The team monitors the machine learning models over time so they can get the experts back when it’s time to upgrade their work.

Google uses pragmatism to engineer great products first (PDF), depending on machine learning only when required. They depend heavily on strong data pipelines and containerized services. They has been organizing around containers since before they were cool.

Microsoft seems to be winning at data work by building strong development stacks, the way they’ve always done. Microsoft knows how to build a consistent and complete environment more than any group during my decades in technology.

I work in startups, so I am typically more of a generalist than you’d see in larger organizations. I’ll get involved on a project and usually only handle some tools or models before it’s time to move on. Small organizations are tactical with a sometimes brutal frugality.

Tools and Practices

I almost always start my work with Jupyter notebooks. Even if I have a lot of data, will need to work on a stream, or will deploy my work in a very different environment, Jupyter is the starting block. Why? Because that’s what I know.

I started a lot of my data work in Stata and Minitab. I was thrilled when I found R. I learned how to make do with Ruby. Ruby is a wonderful language, but I was on my own for a lot of the work.

I use Python generally because it’s a general purpose tool. The data tools are excellent. In recent years, on recent projects, we’ve build data pipelines in Elixir. We’ve transported data with Clojure and Kafka, with Spark, and with other tools. I’m pretty sure Pachyderm is a wise tool, as are Airflow, Glue, and NiFi.

The point is there are a lot of ways to move data, and it’s important that I do it reliably and transparently, expecting to replay the process when things going wrong. That means I store keys with my data about the systems that moved the data so I can roll it back when I make mistakes.

It doesn’t matter to you what I’ve tried, but it should matter to you that I’ve tried a few things. You will too. If you can get more done with Go, Rust, or Julia, that could be exciting. You’re not looking for the best tool. You’re looking for the pragmatic one, the one that makes it easier to get things delivered.

Also, you’re not going to use most of the algorithms most of the time. Ray Dalio launched Bridgewater Associates with an HP Calculator, a composition notebook, and linear regression.

Other tricks that keep things flowing:

I make metrics actionable. That means I replace fuzzy ideas with crisp ones. If a metric goes up or down, people should know clearly what to do next.
I collect common functions. These go into a code repository so I can use them from project to project. Often I have only minutes to do hours worth of work, making this a critical habit.
I use consistent data visualizations. Seaborn is a very good choice for Python-based projects. Observable is a great choice for JavaScript projects. R’s matplotlib is amazing. Vendored projects often do a good-enough job.
I have ready-to-use data exploration functions. This starts with Pandas on a sample of data, but I tend to build things that work well with my models, comparing factors for consistent results.

Sales

So, back to the data person who accidentally became a sales person. How does that happen?

Good data work makes good progress with people. We make decisions based on what we have. This is executive work, stuck in the back corner with a laptop. If we’re not making better decisions, then what is it all for?

Like with proper sales, convincing anyone to do anything takes repetition. It means starting a conversation, then having the same conversation, then listening harder so that you can have the conversation better. It’s about simplifying and speaking to the other person’s interests instead of your own.

That’s not a natural thing to do. People that are particularly motivated learn it, but most people who focus on code and data don’t even pay attention.

Speaking with Noah Gibbs recently, who is a Ruby fellow and does data work in that capacity, he told me that he needed to learn to sell himself before anyone took him seriously. His process was more or less:

Learn to find where people talk about their problems.
Listen.
Put work out that addresses people’s pains, fears, and risks.

James Patterson said that brand is a relationship with people. He’s the best-selling author (of all time, I’m told) who was an ad executive before he started to write novels. Patterson has placed books in most households in America and tells us that he does it by building a relationship with people.

When he says that, he emphasizes that these are people. Meaning, they aren’t a metric, segment, or target. It’s a person that has job pressures, a reputation to protect, and limited time to handle everything.

How many decision makers have you met that can easily stop and concentrate for more than a moment or two on a problem? When I work, I concentrate on something for as much time as I can give a problem. A decision maker tends to jump from problem to problem and can’t easily contextualize something new.

If they see the same thing a few times, learn to trust you, can talk about things with you without feeling small, then you’re building a meaningful relationship. This is your brand: a relationship people can use.

You don’t get there the first time, not usually. If you are exceptionally brilliant, that’s wonderful. That might not be enough. So you keep going. You keep sharing. You keep recommending. You keep going.

Once I was the first person hired at a startup. The seed investor and future CEO of the company had successfully started and exited a company. Now, he was toying with the idea of working with us. He came in and pitched us to us. He couldn’t stop pitching. Maybe he could, I don’t know, but he didn’t. Sometimes we’d get the same pitch three or four times in a conversation. He was measuring our reaction, refining his game. He has a master’s degree from Berkeley in computer science, but he learned to always pitch.

So, you find yourself pitching ideas, building models, and moving data. You find you’re doing a lot of things you didn’t think you were going to do. This is the workflow, from the inside out. This is playing the whole game. I don’t know which is wiser, the person that focuses well or plays the whole game well, but I know which is happier.