Find the User in Data Science

By Zoe Padgett and Eytan Davidovits

This is a repost from the Data Science Experience Blog. View original post here.


When the IBM Design team began researching data scientists, we had a lot to learn; but what we found was our two disciplines had a lot in common.

Without connecting people to data, it’s just a bunch of stuff

The Data Science practice is amazing and complex. A solo data scientist has to form a relevant hypothesis, find a corresponding data set, clean it, and repeatedly build and edit a model to prove or disprove their hypothesis.

The Data Science Experience grew from our attempts to understand data science as outsiders: as designers wanting to build a tool for data scientists. We were curious how data scientists distill something interesting from inchoate data. This curiosity catapulted us into a months-long research endeavor. We synthesized research conducted in our studios all over the world and had conversations with every data scientist we could find. This included hundreds of interviews, dozens of contextual inquiries, and the production of countless research artifacts. We were astounded by the practice we uncovered, and inspired by its creativity. We came to understand data science as storytelling — an act of cutting away the meaningless, and finding humanity in a series of digits.

The data science process is an experiment, the adding and subtracting of elements to find just the right mix. It’s a fluid dance of trial and error, give and take, push and pull. We realized that the tools that data scientists currently use are not designed to support this fluid process of constant refinement — the tools operate in isolation. Data scientists constantly have to navigate away from their workspaces in order to advance and edit their product. This disconnection is where we found our opportunity

Finding our principles

Current tools only address single facets of data science — which means data scientists must toggle back-and-forth between research and development. Data Shaper is for cleaning data, Jupyter is for modeling, and MatPlotLib is for visualizing. These tools are designed to serve a linear process, but a data scientist’s process is not linear, it’s cyclical.

Research artifact depicting the cyclical process of data science

From this model, our first design principle emerged: A holistic approach to enable data scientists. As we discussed before, much of our research involved contextual inquiries. We watched a data scientist build a pipeline — sourcing assets from the web, comparing his code to others’, and constantly jumping from tool to tool. We loved this part of the research, as it helped us understand that each facet of the process requires unique research.

Notes on contextual inquiry during pipeline construction

We saw him use dozens of assets of many different types. We watched him organize and name them. At any given point, he needed a tutorial, an academic paper, or a data set to move to the next step in his process, and each of these assets had to be saved and interacted with in a different environment. The process he used to manage his resources helped us establish a tentative system for artifact classification.

It was also enlightening to watch him browse for resources. Whether he was scrolling through lists in databases or scanning forums for code, he had criteria for assessing the value of these artifacts. We watched him pull code from several different projects and seek advice on API implementation from a forum.

It became obvious that a data science project can’t just stand on its own. It needs support and validation from the community. An artifact, whether code snippet, API, or academic paper, is only as strong as the people who use it. The more an artifact is employed, the more people there are to discuss it. The public use of an artifact sharpens its quality. The value of an asset is determined by the discussion around it — its documentation, its versioning, and its critics.

The evolution of data science is fueled by the collaborative processes of building off of each other’s work. This understanding led us to our second, and arguably most inspiring principle, Community first. The community is the strongest tool a data scientist can access. So why hasn’t it been factored in any of their current interfaces?

Turning principles into practice

We wanted to create an interface that was open and dynamic, just like the modeling process we observed. We determined that our concept must allow the data scientists to converse, learn, and research in the context of their software. We knew our design had to operate as a toolbox that was more dynamic than just a collection of software applications. In addition to providing data scientists with the full scope of software products that they need to complete their process, we need to address their need to validate and advance their work through research.

This helped us design one of our first concepts: the maker palette. This feature developed from the idea that the community is a tool — just as important as a notebook or data set. The design treatment is just the same as any other resource — it appears in a panel that can be opened and closed at will. The benefit is that it’s not specific to a file format or tool, so it can be accessed in any part of the interface.

A user test with the maker palette

In the community palette, a data scientist can find data sets, access papers, view tutorials, and compare their code to others. When they’re uninspired or stuck, the community acts as both peer, tool, and teacher.

Mixed content

The practice of data science surrounds the building of a pipeline, which is a sequence of algorithms that process and learn from data. As we watched data scientists build their pipelines in notebooks, we likened the process to building a wall around a garden, brick by brick. Each brick must be tested to see if it fits the within the bricks that preceded it. These bricks, collected piecemeal throughout the process, slowly enclose the desired pieces of data. The implementation of these bricks requires supplemental materials, like documentation and user testimonials. While these materials will not be included in the pipeline, they need to be viewed in the context of the code. Although they manifest as different file types, these materials are building blocks also, and are just as necessary to the advancement of a project as an actual line of code.

The brick building metaphor inspired the form of our design. We translated the modularity of pipeline construction into a card design paradigm for the interface. Having a uniform treatment for a variety of content types allowed us to streamline the search for resources. A key component of our maker palette was the ability to display mixed content in a singular environment. The data scientist can search for any type of asset inside of their workspace, and review and reference it in a singular, cohesive environment.

The design of our cards was shaped by repeated user testing.

The card-in-panel format gives the data scientist the ability to quickly test a variety of assets in their work. They can make off-the-cuff adjustments without having to make time commitments to deep research or additional tools. They can repeatedly complete the cycles of their work — ask, build, test, refine — in one unified experience.

In data scientists, we see ourselves

In IBM Design, we often discuss “the loop,” or the practice of continuous refinement of an idea through research and testing. Like the scientific method, we design a hypothesis, develop prototypes, test them, make observations, and adjust. As software designers, we’re constantly trying to find the storyline in “stuff.” Much like data scientists, we sift through the extraneous to find the human elements in products and processes. At the beginning, data science seemed complex and distant, and now, after all our research and a little self-reflection, it seems strangely familiar.


Visit datascience.ibm.com to learn more.