What I learned from NormConf 2022
Summary of selected talks and lessons learned
--
NormConf is an online tech conference about things that matter in data and ML but don’t get the spotlight. As something that started as a Twitter joke, NormConf 2022 exceeded anyone’s expectations. It encompassed many excellent presentations from smart people sharing stories from real-life experiences in the field. All talks, available as a YouTube playlist, are worth watching. This post summarizes core messages that will likely apply to anyone, regardless of your role and title.
An ML Fairytale — keynote by Vicki Boykis
In the conference keynote, Vicki shares a fairytale about an ML engineer called Vectorella. Vectorella symbolizes a deeply curious person, eager to learn and work hard to solve interesting problems. She discovers, however, that the work she is asked to do often doesn’t reflect the cool ML she imagined. Instead, she must work on often tedious and repetitive tasks involving pulling data, cleaning it, scheduling jobs, and dealing with YAML and distributed clusters.
She then discovers the private staff-mle Slack channel. She imagined that people there must be doing fascinating and important work. To get there, her manager asks her to perform three more tasks to prove she’s worthy of that channel. So she puts on her hat and gets to work. Finally, after a long time of hard work and pushing through her comfort zone, Vectorella’s dream comes true — she’s finally invited to that channel.
Once she joins that channel, she realizes that the Staff MLEs are doing almost the same work she did all along, perhaps just on a larger scale. But, unfortunately, even the more advanced work is not nearly as exciting as she envisioned. Shocked, Vectorella discovers that in data science, “the advertised map is not the true territory”. The true territory you deal with on a day-to-day basis involves building crontabs, writing YAML, adding things to lists, and counting them.
The moral of the story is that as data practitioners, we are all Vectorellas.
Building ML systems means building software, and that process is fragile and involves hard work, which is not nearly as glamorous as often advertised in the media. Even advanced work deals with normal problems — e.g., ChatGPT has an impressive language model in the backend, but it also involves dealing with problems such as serving web requests. You need to deal with all the pieces this work entails.
Ad astra per aspera: “To the stars through difficulties”
The upside to the story is that the solid fundamentals build to advanced work. Those boring fundamentals you won’t see in the media (except NormConf!) are the fundamentals that, if mastered, lead to more advanced (and potentially more fulfilling) work.
How to translate to PM speak and back — by Katie Bauer
Katie Bauer shared real-life advice on talking to people and understanding their incentives. The main takeaway is to assume good intent and remember that you and all your coworkers, including PMs, are on the same team.
Katie shared her experience building a healthy team structure that encourages knowledge sharing, best engineering practices, and providing just enough business context to help people achieve their objectives. For instance, running rigorous experiments is essential, but PMs are not looking for extremely specific answers. Instead, they seek understanding and direction driven by data rather than a precise answer. So don’t speak details. Instead, speak about how to make iterative progress towards shared objectives across the org. Show PMs how data can help them achieve a certain outcome, e.g., accomplish some OKR, deliver a feature, or win an argument, especially when competing for resources and working on a not (yet) successful product.
A general heuristic Katie shared was to plan your work according to the product positioning:
- With a struggling product: quantify the loss and stop the bleeding
- With a successful product: find opportunities to expand
- With an emerging product: identify the product market fit (how will you acquire new customers, what are their acquisition costs, and so on)
Lost in Translation
PMs tend to focus on things they can control, so they care more about the impact of a specific change rather than randomized experiments and probabilities. They want to understand a change’s effect on a product and business. For example, if I spend X more on marketing, how will that affect revenues for that product? If we change this button to blue and move it up, are users more likely to click or stay longer on a site?
You need to carefully pick your battles:
- Keep it at a high level. Provide the big-picture overview PMs care about.
- Give examples that illustrate your point clearly — but insist on supplemental data and more accurate translations when stakes are high, e.g., when making a decision that might be hard to reverse.
Just use one big machine for model training and inference — by Josh Wills
Josh Wills shared his journey from:
- an engineer running everything for ML on one big machine,
- to “that won’t scale, it will cost too much, we do everything in K8s”,
- to today’s Josh, yet again preferring to run everything on one machine and rediscovering the joy of working on complex ML problems along the way rather than solving around the problem
Earlier in his career, Josh was working on an analytics stack that included MySQL, Pearl, and R. He got quite good at operating that stack, but his manager gave him the advice: “Be careful about what you get good at.” This indicated that if he continues that path and becomes even better at this, he may end up administering MySQL databases for the rest of his career. This story highlights the main point of the talk: if you get good at building large-scale distributed pipelines, you may forget you never wanted to get good at Spark and Kubernetes — you wanted to get good at solving important ML problems.
What followed were interesting personal stories that led Josh to the following insights about why one big machine for ML is the right decision to embrace for almost everybody:
- It’s a useful heuristic for identifying important problems. If your manager asks you to work on something that seems important, but considers an investment of, say, 12 dollars per hour for a large EC2 instance on AWS, then it’s a good indicator that this ML problem is not as important as it seems to be. Important problems deserve dedicated infrastructure, tooling, and proper resources.
- The cost of one big machine is a feature, not a bug. When you spend a lot of money on a dedicated machine for ML, it’s a confirmation that you are working on something meaningful, and you need to focus and get it to the finish line and shut down the resources when you’re done. Building cost-efficient scale-to-zero infrastructure for ML training and experimentation is often a distraction, solving around the problem. If the problem you’re solving doesn’t justify the costs of a single VM, is this work impactful enough to the business? Is this problem even worth solving?
- Choose boring technology because it lets you save your innovation tokens. Every company has a limited amount of innovation tokens to use a new framework, build a custom data store, etc. Experimenting with new tech is easier to justify when running it on one machine without incurring the costs of a distributed system, such as dealing with network or server/cluster coordination issues. For instance, you can still run Ray or Dask on one big machine if you want to test things out. Taking distributed systems out of the picture allows you to keep your innovation tokens fungible.
- ML is hard on its own, there’s no need to add distributed systems to increase the complexity even further. Josh mentioned an anecdote that Stripe is still training their ML models on one machine.
- Make feedback loops fast. Troubleshooting problems on distributed systems is hard. Doing the same on a single machine can be as simple as running
htop
and combining it with atail
of logs. Those two simple CLI commands give you more insights into what’s happening in your ML training process than any MLOps tool can provide, e.g., how work is distributed across cores, memory utilization per process, and investigating why a certain process got stuck.
How to pick the right instance type in the public cloud?
- Pick an instance with as much RAM as possible, ideally fitting all data you need into memory.
- Get more storage. A standard EC2 instance has only enough basic disk volume needed to operate the VM, not storage you would need to store data and intermediary results. To solve it, attach a large EBS volume to your instance.
- When you don’t use the machine, shut it down, take a snapshot of your EBS volume, save it, store any intermediary data on S3, and clean up by shutting down or terminating the instance.
Finally, Josh talked about being careful not to over-automate prematurely. Instead, rely on minimum viable automation, i.e., automate only as much as needed.
Don’t do invisible work — by Chris Albon
Chris Albon has worked as a manager for a long time and shared valuable lessons on why you should make your work visible to yourself and others.
Problem:
Many individual contributors struggle with a common problem: they spend their days on recurring work, ad-hoc analysis, and helping others with urgent tasks.
Then, as time passes, you forget what you did for most of that time. Have you contributed something impactful to the business? If so, how much do you remember about it? If you don’t consciously write down what you accomplished, you will remember very little, and your boss will remember even less.
If your work is not tracked, it’s invisible. Some work is more susceptible to being invisible: communication, mentorship, ad-hoc work, and special projects, for example. Write those down especially. Promotions, performance reviews, layoffs, and bonuses are based on work people remember. If no one remembers it, it’s as if you never did it. In short, you won’t get credit for invisible and forgotten work.
Solution:
People are bad at remembering, especially without tools. A simple and practical solution is to record your work and tell people about it. You can build a lightweight system to record your work using some of these methods:
- Create a simple private activity log in your note-taking app
- Add short notes about what you did on a given day and send those to yourself as a Slack message.
- Aim to write 2–3 lines per day in any way that makes sense to you to start a written record.
- Check out bragdocs.com for more help on this.
You can use these logs to account for your work progress towards a specific goal or help you formulate concrete arguments for a promotion.
Once you record your work, you need to tell people about it. For example, share your experience about the project you did as a blog post, (internal) documentation, or polished notes of what you accomplished based on your activity log.
As a general heuristic, formal is better than informal and written is better than verbal. Share your work via Slack, Notion, or a public blog post. Get into a habit of recording your work and telling people about it. Once you do, when someone asks you what you accomplished during the last quarter, you and that person will have a deep well of concrete examples to choose from.
Thanks, NormConf!
Thanks to all speakers, organizers, sponsors, and the broader data community. It was an insightful and valuable online conference. Let’s hope this will become the new pre-holiday tradition!