Designing Data Science Tools at Spotify
Spotify operates at a massive scale: We have millions of listeners whose activities generate huge amounts of raw data. Raw data by itself is not that helpful though; we need to be able to process, manage, and distill it into insights that can inform new features or improvements to the experience. And to do that, we need usable, well-designed tools that ensure these insights can be easily understood.
Up until recently, the tools Spotify’s data scientists used every day were designed mostly by engineers. There was no one dedicated to looking at the problems data scientists were experiencing holistically. This meant that a lot of the time, the tools were strung together with inefficient hacky workarounds.
Throughout the past year, a design team was created to rethink the existing stack and weed out these bad practices.
I’m a product designer in the R&D Community at Spotify, and I’ve been working in the data tools space for about a year — which makes me one of the longest-serving designers in the team. I was brought in to pair up with engineering squads working on platforms and experiences for data scientists. Most recently, I helped to create and launch a new data science tool that would expedite insights production, and eliminate those old, inefficient ways of working.
Hierarchy of needs
Before I get into the nitty-gritty of how we designed the new data science tool, it helps to understand how data scientists transform raw data into usable insights.
In her post for Hacker Noon, Monica Rogati explains The AI Hierarchy of Needs. This is the idea that there are many steps between getting data and using it for business.
First, they need to collect the right data.
Then they need to process that data.
Only when it has been processed can it be analyzed and explored.
Existing landscape
When we started thinking about how to do this at Spotify, there were already some tools in use:
BigQuery
This is a Google data warehouse product with a web user interface where data scientists can store and process data. They can write queries here to make sure they have the right dataset for their question.
Jupyter Notebooks
Often simply called “notebooks.” Notebooks are an open-source interactive workspace for running code in blocks mixed with prose.
After Data Scientists use the BigQuery UI to validate their dataset, they use local notebooks to find insights, create visualizations, which explain the findings and share their work (among other tasks).
ScienceBox
This is an internal Spotify command-line interface tool to help speed up the way Data Scientists use notebooks. It’s commonly used as a way to organize files into projects, pre-install data science libraries, and create a standardized and reproducible data analysis workflow.
These tools worked well for small datasets, but as data scientists were expected to work with bigger and bigger datasets, however, they had to wait longer to see the results of their code. If we expect data scientists at Spotify to find high-impact insights from the huge amounts of data we collect every day, they need tools to help create high-quality insights at high speeds.
Our design challenge
By the time I got involved in the project, the basic framework for the plan had already been established. ScienceBox was going to be rebuilt with a UI in the cloud, allowing us to unlock cloud computing benefits such as scalability, high-speed processing, and infrastructure flexibility.
We hypothesized that improving this tool by adding processing power, scaling discoverability, and using cloud infrastructure, we could help data scientists analyze data more efficiently, improve collaboration, and reduce the time to find insights in data.
To start off, I caught up on all the research conducted so far, and mapped out the existing workflow so we could clearly see the changes we needed to make. With the help of a visual workflow, we saw that we could group the type of work into two main types — “ad hoc querying” (i.e. quickly querying data to find immediate answers), and “long-term investigations” (structured projects with month-long timelines).
I then segmented our users into targeted groups so we could make user-informed design decisions for the workflows we identified. Below is a sample flow for a data scientist running an ad hoc analysis.
What we learned
In rebuilding a crucial tool like ScienceBox in the cloud, we learned 3 important lessons along the way that informed our approach and ultimately led to a more effective tool. They were:
- Quick actions make life easier
- Design for discoverability
- Highly variable workflows are normal
Quick actions make life easier
When we started this project, we watched many data scientists use existing analysis tools to learn how these tools were used. We learned a lot about the limitations of these tools in the context of their work. For example, we learned that if data scientists were conducting analysis locally, it could take up all of their laptops’ computational resources. This meant they couldn’t use their laptops for other tasks, and they often ended up running their queries overnight. It also meant that analysis running throughout the day could involve a lot of waiting.
Our decision to create a cloud product would allow data scientists to add processing power by running code in the cloud, rather than on their laptop. They would use virtual machines (VMs), an emulation of a separate computer system, to speed up the time it takes to run code. These VMs range from standard size (standard speeds) to large size types (extra-high speed and memory). With these VM types, data scientists could free up their laptops for other tasks, run multiple jobs at once, and run each job faster.
There was one catch: Our internal interface would mainly function as the launchpad into the web-based interactive development environment, JupyterLab notebooks (the next generation of notebooks), in a new browser tab.
All the data analysis and processing work would take place in these notebooks, but since each one needed a virtual machine to power it, every data scientist had to use our separate internal tool at the start of each project to add those resources.
Our challenge was to design a product that enabled data scientists to access notebooks as quickly as possible.
Initially, we thought that data scientists would choose a notebook project before deciding what size the VM powering that notebook should be (the bigger the virtual machine, the higher the speed and memory capacity).
This reasoning meant that the VM controls were considered a secondary action, which we hid in a slide-out side panel. However, after user testing, the team and I learned that this hypothesis was incorrect. Controlling the VM was actually one of the main needs in the ScienceBox Cloud UI, so it needed to be front and center.
To solve this, we iterated based on the user testing results and added a VM control as a “quick action” in line with the notebook project name.
This worked better than expected. As a workflow shortcut, it allowed users to immediately jump in and work without thinking about how to administer their VMs. Additionally, the status of the machine served as a quick way to sort active projects, so that users could visualize which projects they were working on.
Design for discoverability
When we were researching, we learned that the many notebooks scattered across Spotify meant it was hard to locate past work.
In our solution, since we believed that data scientists needed access to only their team’s work, we first decided to limit the search results by auto-populating every data scientist’s account.
However, we quickly learned our initial assumption was incorrect. We found that Spotify data scientists often worked across teams and needed access to a wide variety of past work. The data owners they needed to talk to were often on different teams.
We pivoted, focusing on increasing the discoverability of notebooks to improve collaboration. That meant displaying every findable notebook in our database, enabling users to search and discover notebooks created by others in addition to their own past work. Mapping different discovery flows became important. By visualizing workflows and user journeys, I helped the team understand what changes could have the biggest impact.
- Highly variable workflows are normal
At first we thought it would be simple to find a typical workflow among all our users. However, we learned that, while all data scientists want to get insights from data, there are many ways to reach that goal. Some run one-off queries to test hypotheses, others are embedded in month-long projects that require complicated analysis.
Instead of forcing one flow on everyone, I designed an interface structure that was flexible enough to accommodate a dense amount of variable information, while highlighting a few primary actions.
We knew that the users were mainly visiting our platform to launch their notebook tool, so our quick actions had a large “Open” button that brought them directly to their coding environment. For notebooks without a VM, we made it easy for users to add one. For the layout, we created sortable columns and expandable drawers to empower the user to arrange the information to their liking.
Conclusion
This is one of the most exciting products I’ve designed at Spotify. While it was challenging to create a product that served our many discovered use cases, it was also highly rewarding.
Firstly, ScienceBox Cloud has become highly successful. By having designers dedicated to creating a better experience for data scientists, we eliminated those old inefficient practices and allowed them to run their code up to 50% faster than before.
Secondly, throughout the process of prototyping, testing, iterating and building ScienceBox Cloud, I’ve had a lot of assumptions challenged and re-formed.
At first, I didn’t think that notebooks could take hours to run and take over the entire laptop’s computational resources; I now have a much deeper understanding of how a query can impact analysis time. Additionally, I thought data scientists all had very similar ways of working; I now understand their workflow is highly dependent on the type of problem they’re solving. I’ve learned so much about how data scientists collect, process, understand, and analyze data to create insights that drive Spotify decision-making.
Finally, now that we’ve set up this great foundation, I think we can go so much further with notebooks. We have many more questions to answer — e.g. Can all types of users who need notebooks easily use ScienceBox Cloud? How much faster can we enable our Data Scientists to work? How else can we help Spotifiers work more efficiently?
All that for tomorrow!