Building Together: Why Data Projects Fail and How to Foster Collaboration

A Samuel Pottinger
5 min readSep 17, 2022

--

After a long road with engineers and scientists working hard on a new tool or model, things can still fall flat. Unfortunately, somewhere around 3 out of 4 data projects fail with studies and articles highlighting collaboration and adoption as common challenges [1][2][3][4]. Of course, for software in general, user engagement and internal buy-in are top success criteria with tools like user research, Agile, and participatory design all solving for these long-standing issues [5][6][7][8][9][10][11]. However, especially in AI / ML, data projects face unique challenges: model uncertainty, mechanical opacity, single authorship, and limited agency all put up barriers to engagement and feedback. Using one of my recent open source projects as a case study, what can data teams learn from other disciplines and, if collaboration matters so much, what steps can they take when working with other teams?

Challenge 1: Modeling with uncertainty

Working with and discussing uncertainty often puts up the first hurdle. Consider that some collaborators may not have deep familiarity in probabilistic systems, making it difficult to reason using tools common to data science teams like different probability distributions or metrics.

User clicking on a button which shows an increasing counter of simulations.
Small UX gestures can get the user thinking about a population of outcomes not one answer.

Of course, stepping back from jargon may help: contrast “80% precision” to “when making predictions that something is fradulent, it is right 80% of the time” [12]. However, more fundamentally, internalizing different probability distributions is challenging and information desgin can help foster “distributional” thinking. For example, FiveThirtyEight’s “bee swarm” in their probabilistic 2022 midterms forecast encourages readers to think of election predictions as a collection of possible outcomes instead of a single answer, showing individual possibilities to 1) hint to the user that their prediction is actually a distribution and 2) reveal the shape of that population of divergent outcomes [13][14]. One of my recent open source projects (StartupOptionsBot) displays start up options in a similar way [15].

Scatterplot of months versus profit showing individual simulations as dots.
Showing individual simulation results encourages distributional thinking.

Like election forecasts, a series of complex events may change the outcome for a start up employee’s options, the part of their compensation which lets them purchase shares in their company (disclaimers, not financial advice) [16]. Employees could make or lose money and, once more, the “population shape” of outcomes for this investment often proves more informative than an overall average answer. Therefore, StartupOptionsBot reveals each of its Monte Carlo trials in a scatterplot to sketch out an overall population of potential outcomes. In short, displaying individual simulations’ results pushes users to think distributionally.

Challenge 2: Mechanical opacity

In addition to addressing uncertainty, note that the complex mechanics of a machine learning / probabilistic model often become difficult to describe in a succinct human-understandable way. Therefore, open up access to tools like LIME which can estimate the importance of different attributes to “explain” an individual prediction [17]. Furthermore, this information can surface using interface design patterns for “contextual” explanations, offering insight without disrupting user flow [18].

Dialog showing individual events within an individual simulation along with its result.
Allowing users to access explanations for individual results builds trust and understanding.

As a demonstration, consider how StartupOptionsBot attaches a click listener on dots representing individual simulations [15]. Still leaving an uncluttered view of the overall distribution, the user can access the individual events that lead to a result, providing intuition for what scenarios lead to what outcomes. Indeed, FiveThirtyEight’s presidential forecast does something analogous with individual simulation-level electoral maps [14]. In short, transparency into mechanics helps build model confidence and prediction explanations can foster understanding of a problem.

Challenge 3: Single authorship

After addressing uncertainty and opacity, consider that techniques like participatory / co-design aim to bring users into the design / development process as collaborators [19]. How might this work for data projects when real world systems often require the skills of a data scientist or engineer to interact with a model or pipeline? Solutions like the climate change En-ROADS simulation allow users to “play” with model parameters to see how different choices change warming outcomes while products like Grid or Observable allow making parts of a model interactive [20][21][22]. These create safe environments for productive exploration while preserving underlying mechanics and safeguards of a model.

Animation showing a interface with collapsible sections.
Progressive disclosure can help the user see the main loop of a tool.

As a case study, consider the interaction design of StartupOptionsBot’s UI for building simulations [15]. First, notice that it uses “progressive disclosure” to group different controls behind collapsible sections so the user can ease into system complexity, starting with an approachable overview of model structure before showing details [23]. Second, it borrows from game design’s “core loop” concept, using highlighted buttons and color to direct the user through the cycle of changing a model parameter and seeing the result [24]. Altogether, these design chocies can create iterative experimentation sandboxes that offer an environment for safe exploration.

Challenge 4: Limited agency

Of course, there’s labor in navigating these sophisticated interfaces. While these techniques allow users to manipulate model parameters, the complexity in UIs like En-ROADS and StartupOptionsBot mean, at some point, these tools become akin to programming for heavy usage: achieving similar complexity to code-based programming just in a graphical interface [15][20][25]. However, language design might help!

Code editor showing a domain specific language.
Domain specific languages provide a streamlined path for broad authorship.

For example, StartupOptionsBot provides a small “domain specific” programming language. Narrowly focused to a specific environment or project, these “DSLs” may prove easier than a general purpose language (like Python) for a task and enable collaborators’ true co-authorship in a system [26][27]. Indeed, as often found in comparison of visual vs code-based programming, the compactness of code representations in StartupOptionsBot often prove more workable than the UI equivalent for large simulations [15][25]. Though counter-intuititve to move from UI to code for usability, consider if your data tools already require a kind of “visual programming” and which methods provide deeper shared authorship / agency.

Inviting collaborators from outside data teams can help build stakeholder support and incorporate new expertise into modeling, fostering adoption and success. Data projects present unique challanges but these techniques may pave the way for collaboration and co-authorship. All that said, I want to end by acknowledging that this takes resources: data visualization, software engineering, and infrastructure. That said, if having these skills available reduces risk of failure, can your data team afford not to have them?

Like this and want more ideas at the intersection of design and data? Follow me! Also, see slides from a related talk.

--

--

A Samuel Pottinger

ML, data science, data viz, MLUX. Director of Data at EVERY working for humane sustainable food production. Views my own. gleap.org