AIQ (Artificial Intelligence Quotient): Helping People Get Smart about the Smart Machines They Are Using
Gary Klein and Joseph Borders, ShadowBox LLC; Robert Hoffman, Institute for Human and Machine Cognition; Shane Mueller, Michigan Technological University
Willingness to use AI tools and ability to use them effectively requires reasonable understanding of what they do and how they work. AIQ is a set of tools that are designed to support the calibration between the human and system so that people can be successful in situations where it is risky to trust the recommendations of the system. These include cases in which the boundary conditions cannot always be established in advance and/or where efficacy and reliability/safety data do not yet exist. AIQ offers a variety of tools and strategies for different stakeholders in an organization (e.g., developers and operators) in dealing with these settings. This essay is relevant to all these stakeholders, and particularly to the sponsors and developers of advanced AI systems.
Imagine that you have thrown a party — arranged for all the snacks and appetizers, assembled the drinks and punch bowl, remembered to put out the napkins and small plates, and invited the guests.
And imagine that no one shows up. Not a single guest. Think about the emotions you would experience as you look at all the food and beverages on the tables.
Now translate that scene into an Artificial Intelligence application, say, an advanced Machine Learning system that you and your colleagues have spent many months, if not years, developing. Time for the big launch, and no one shows up. You do have some customers who take the plunge on a contingent basis, but most of them back out and cancel their license. Again, think about the emotions you would experience.
We have observed several instances where this is precisely what happened. The prospective adopters of a system abandon it because they don’t understand and/or don’t trust it. Or else the developer team is discouraged from working to provide usability support because they will be penalized for overruns in cost and schedule (Lesgold, 2012). Here is another example: A large petrochemical company that ShadowBox LLC worked with developed a sophisticated event prediction system, designed to help console operators monitor the plant and detect any problems gradually forming under their radar. The problem with the system was the number of alarms it would generate, most of which would either be false or unhelpful. To add to operators’ frustration, they could not modify the alarm sensitivity to their liking. This tool that was supposed to help operators, instead turned into an annoyance leading them to simply turn it off.
This example illustrates an important point for designers and developers. It is impossible to predict all the problems with a system before it is put into use (Woods & Dekker, 2000). The operators working with this tool needed to have a chance to explore it and tune the alarms to their own specific needs. The engineers didn’t allow for this in their initial design. The work needs and requirements of the prospective adopters have to be well thought out and explored. Otherwise, operators are likely lose faith in the technologies and it takes some time before they are willing to try working with the AI systems again.
Obviously, this happens too often (Zachary et al., 2007; The Standish Group, 2014). The investors you’ve lined up are disappointed, frustrated, and angry. Or maybe you are one of those investors, one of those sponsors, and you can scarcely hide your bitter feelings.
The failures of innovative AI/ML (Artificial Intelligence/Machine Learning) systems probably far outnumber the successes. Failures occur for several reasons. This essay examines one of them: the users weren’t comfortable trusting the system. The essay also offers a set of methods for addressing this matter. But first, let’s examine the trust issue in a little more detail.
Many costly AI/ML systems are rejected by the users because of a lack of trust. Or rather, a lack of appropriate trust.
This problem isn’t new.
Appropriate Trust
“They’ll just have to learn to trust the system!” These forceful words were uttered about 30 years ago by an Air Force colonel in a high-level meeting in the Pentagon. He was frustrated that so many people — pilots, maintenance technicians, logistics specialists — didn’t want to use the Artificial Intelligence systems that had been designed to help them. The users were complaining that they didn’t know what the AI recommendations were based on, and they had no easy way to find out.
Remember — 30 years ago is the time of rule-based production programs and expert systems, which we now view as a primitive and very limited form of AI[1]. Decision makers were quite justified in being skeptical about the new technology, but the colonel had no patience for their worries.
The resistance wasn’t just coming from the Air Force. Professionals in many fields, from military to healthcare to transportation were pushing back against the AI systems being foisted on them.
And the AF colonel in this meeting had heard enough. He was a true believer. His mindset involved what is called “default trust” and “optional trust.” The latter is reasoning about whether or not to trust, and default trust is to simply trust in and rely upon the system because it is thought of as a computer or AI system (Hoffman, 2017).
The colonel’s words would resonate today with the true believers in current forms of AI, Machine Learning, Deep Neural Nets and Reinforcement Learning. These systems are far more inscrutable than the early production rule programs of the 1980s and 1990s. Today’s users face a seemingly impossible task of trying to understand how the AI/ML systems work — even their developers have to guess where the machine recommendations are coming from because the AI/ML systems are learning on their own, chewing through millions of examples and somehow synthesizing these data.
But the lack of understanding is just one part of the problem. Another part is that the AI/ML programs can be brittle. They work well within well-ordered, constrained problems but become less reliable in the face of ambiguity. For example, computer vision models struggle to accurately classify normal objects like cars and furniture if they are rotated in unnatural ways. That is because the AI/ML systems rely on associations that they have “seen” before (through a large training corpus). Furthermore, it’s difficult and sometimes impossible to know the range of situations that the system “understands” if they incorporate deep learning techniques, which can make the boundaries of the training sets untraceable.
An additional element of appropriate trust is believing that there is a straightforward, timely, and affordable way to tune the system in response to experience with its use. Very few systems continue to be used without modification. To be comfortable managing use of a system, one has to not only know how it works, more or less, but also how it can be tuned.
The limitations of AI/ML (brittleness, difficulty with uncertainty, difficulty with modifications, and the inability to provide sufficient explanations, to name a few) make them less than ideal for dealing with complex and uncertain environments, especially where there are high stakes and risk. There are some jobs that require operators to rely upon systems, even if they do not fully trust them. However, we are focusing on professions in which the system user has some autonomy and responsibility for decision making. Physicians are reluctant to follow AI/ML recommendations that they don’t understand and therefore don’t entirely trust them — if something goes wrong it is the physician who will be blamed, not the programmers.
People with experience are adept at dealing with problems in these environments (i.e., sensemaking, identifying feasible solutions, generating workarounds). That’s why skilled humans are still necessary to operate these systems, especially when risks are high and there is a small margin for error (Hoc et al., 2013).
Unlike the colonel mentioned above, we cannot and should not mandate that decision makers simply trust in and rely upon the AI/ML systems they are given. People should form appropriate trust, not unwavering and blind trust. And their trust judgments have to reflect several problems: that the AI/ML systems are so opaque, brittle, and bug-ridden. Decision makers must gain some understanding of the AI/ML systems to achieve appropriate trust (Riley, 1996). This involves determining under which conditions it is appropriate to trust the AI/ML systems, and to manage those systems and work around their limitations.
Of course, there are many examples of automation that we don’t have any qualms about using. We don’t want to manually control airbag deployment in our cars, despite the occasional news story about a rogue airbag that deployed unexpectedly. We have no hesitation about pressing buttons in an elevator, even though the calculations for controlling a bank of elevators are surprisingly complicated. We don’t want to investigate those calculations. In the old days, each elevator car had its own operator. Today, the occupation of elevator operator has disappeared. Our trust is often “default.”
Other jobs require more collaboration between humans and technology, and places them in an interdependent relationship (Johnson & Vera, 2019). AI/ML systems are used in some work settings to generate predictions that can have high stakes, such as identifying military targets and conducting cancer screening. As AI/ML systems intrude further and further into our lives, the challenge of managing them becomes more important, and more difficult.
The DARPA XAI Program
DARPA (Defense Advanced Research Projects Agency) was concerned that too many promising AI/ML systems were being rejected or under-utilized because the users were unable to understand enough about how they worked. DARPA therefore initiated a program in 2017 — Explainable Artificial Intelligence (XAI). The XAI program established 11 international teams of leading AI/ML researchers and developers to each take a shot at building explainability into AI/ML systems. These teams primarily crafted intelligent, algorithmic methods to inject explainability.
However, there was another team of researchers in the XAI program.
The 12th Team
In addition to these 11 teams, DARPA established a separate team of cognitive psychologists to explore what it meant to have an effective explanation. Three of the authors of this essay were on that team. (Gary Klein, Robert Hoffman, and Shane Mueller). We are experimental psychologists, and although we further specialize as cognitive scientists, we are not computer scientists. Our approach to AI is empirical: We examined issues such as what is needed for explainability, what is needed for appropriate trust, and what users of AI systems really want.
We found that some users do indeed want to understand how the AI/ML systems work, but others don’t really care — they are more concerned with how the AI/ML systems have been trained because that affects the trustworthiness of the AI/ML systems.
We went beyond our charter. We designed a strategy that was non-algorithmic, and therefore, in our view, much more accessible to the people using AI/ML systems. This was the AIQ strategy.
AIQ — Artificial Intelligence Quotient
The acronym “AIQ” is a blend of Artificial Intelligence and Intelligence Quotient. It is about helping people get smarter, more intelligent, about the intelligent systems, the AI/ML systems that they are using. AIQ is not about helping people become more knowledgeable about AI/ML in general. Instead, it is about helping with the specific AI/ML systems with which decision makers must work.
AIQ is a set of tools that were designed to support the achievement of appropriate trust and reliance between the human and the system, so they can be successful in achieving the work’s goals. AIQ offers a variety of tools and strategies for different stakeholders in an organization (e.g., developers and operators).
The AIQ tools have the potential to achieve a number of goals. Developers can use the tools to evaluate the feasibility and cost/benefit of proposed/existing technologies. They can use the guidelines to build AI/ML-based systems that support various cognitive functions such as decision making, sensemaking, problem detection, and team coordination. The AIQ tools are designed to enhance the explainability of technology/analytics and thereby ensure an appropriate trust in the AI, improve usability, and increase performance.
The AIQ toolkit contains several different instruments. This essay describes five of the most powerful of these instruments: The cognitive tutorial, ShadowBox, the Collaborative XAI tool, the Self-Explaining Scorecard, and the Discovery Platform. These can be used individually or in concert.
The AIQ strategy, to help users gain a better mental model of the AI/ML systems they are using, is guided by the Mental Model Matrix (Borders, Klein & Besuijen, 2019; 2022), shown in Figure 1. The Mental Model Matrix (MMM) is based on research we conducted with panel operators at a very complex petrochemical plant. The operators were continually on guard for problems and early signs of potential upsets. We found that part of their mental models were about how the systems they were operating worked. That is the upper-left-hand quadrant of the MMM. Most accounts of mental models are restricted to this single quadrant, and we learned that there is much more to effective mental models. The upper right-hand quadrant is about how the systems might fail, the boundary conditions, the inherent limitations. It is not enough to know how a system works. For the sake of safety, efficiency, and effectiveness, operators must be aware of how the system can break down. What to do in the face of a breakdown? The lower left-hand quadrant shows that an effective mental model includes ideas and tactics for working around the limitations of the systems. And with any complex systems, operated by teams, there is the potential for confusions and losses of common ground. The lower right-hand quadrant calls this out — effective mental models need to include ways to anticipate these confusions.
Figure 1. Mental Model Matrix.
The Mental Model Matrix has helped us configure the tools in the AIQ toolkit.
The Cognitive Tutorial. This tool helps decision makers better understand how their AI/ML systems work but also how they stumble — their flaws and boundary conditions. It provides diagnoses for these limitations. And it helps decision makers learn some workarounds when they encounter boundary conditions.
The cognitive tutorial is an experiential guide to help users understand the strengths and limitations of complex (and opaque) systems. It is a continuation of the work of Mueller and Klein (2011), who developed such an experiential user guide to assist people wrestling with complex algorithms such as Bayesian decision aids. The Cognitive Tutorial has expanded this work to apply to AI/ML system. The tutorial is built around different modules that provide examples of how the system works and breaks down (e.g., boundary conditions), and it includes diagnoses of why the system got it wrong. The idea is to help understand the complexities of the system and better calibrate them to the system, so users can be more effective in using their AI/ML systems to accomplish their goals.
Through the course of working with an AI system, a user can learn about how it operates under different conditions. The cognitive tutorial must be arranged so as to align with the learner’s knowledge and capabilities. Under normal conditions the user may pick up on patterns and regularities, which make the system predictable to some extent.
Over time, the operator may also gain an understanding of its boundary conditions, such as where it breaks down and is fallible. In fact, sophisticated users like to initially test the limits of the system before using it. These kind of edge cases may not happen frequently so they may not be planned for, which can result in irregular and unexpected deviations in how the system is working. The power of the cognitive tutorial comes, in part, from presenting examples and exercises that reference the edge cases.
Here is a simple example of Cognitive Tutorial materials, developed by Shane Mueller in collaboration with Anne Linja (2022), using social media comments about Tesla’s Fully Self-Driving (FSD) capability. The Tesla FSD system works surprisingly well. Figure 2 shows screen shots for the FSD system on a two-lane highway. FSD keeps the automobile on the right at least 80% of the time.
Figure 2. FSD with lane markings.
However, what happens when there are no lane markers? Figure 3 shows screen shots of the FSD reaction. It moves to the center of the highway, regardless of oncoming traffic. (One shudders to imagine FSD adapting to snow-covered roads.) FSD keeps to the right only 20% of the time when there is no center lane. Caveat: these percentages have not been verified, and the FSD capability may have improved since the study performed by Linja, Mueller, and colleagues (2022).
Figure 3. FSD without lane markings.
A standard user manual would warn drivers that they need to be continually alert. But the visual impact of Figures 2 and 3 is stronger than a mere warning and provides drivers with a better sense of the boundary conditions for the FSD capability. In fact, Mueller’s team has collected data showing that the cognitive tutorial format results in better learning than presentation of repeated examples.
Additionally, we are currently using the Cognitive Tutorial with a U.S. Air Force project, to increase the impact of an AI/ML-based Machine Translation system.
ShadowBox. ShadowBox is a method for training cognitive skills. This method presents people with challenging scenarios and interrupts the action to describe different options such as alternative courses of action or alternative goals or alternative items of information to be pursued. The trainee assesses these options, and then is shown how a small panel of subject-matter experts has assessed the priority of the same options. In addition, the trainees must describe their rationale for their assessments, and then the trainees see the synthesized rationale statements of the experts. In this way, trainees get to see the world (or at least this scenario) through the eyes of experts without the experts having to be present.
Klein & Borders (2016) used a collection of ShadowBox training scenarios to enhance warfighters’ social skills that are necessary for building trust with civilians in hostile environments. However, ShadowBox can be used in a variety of unique ways to present AI/ML users with straightforward tasks and with complex tasks, including tasks in which the AI/ML limitations are on display, and thereby appreciate how experts anticipate the limitations and work around them.
Collaborative Explainable AI (CXAI). This is another tool intended to help people who rely on an AI system (Mamun et al., 2021). The CXAI tool enables people to leave notes for future users and their colleagues and engage with them to share experiences. From a developer perspective, CXAI can provide insight into the innovations and adaptations that users create, as well as issues they encounter.
The CXAI capability is analogous to collaborative community-based question-answering comment-boards (e.g., StackExchange, Quora), in which users pose questions (problems, challenges, bugs) about specific software tools or computer games and others provide answers (e.g., discovered rabbit holes, workarounds, insights, heuristics). Stored questions and answers can be queried using keywords.
CXAI uses a similar tactic, specifically for the use of AI systems to enable AI developers and users to leave notes for future users and get feedback from other users on the tips and concepts they have discovered.
Thus, although a CXAI thread may begin with a particular question or problem that a particular user is having with a software tool, it may also begin with something that a user discovered or an explanation they generated about the system.
Figure 4 illustrates the post-it messaging kludge that the CXAI capability is intended to replace.
Figure 4. Post-it messaging kludge.
The Self-Explaining Scorecard. In order to gain acceptance for their systems, and to gain the trust of the users, AI/ML sometimes try to provide partial explanations for how their systems think. This is the core concept of Explainable AI. Yet that strategy may be misguided because the users do not just passively receive messages and prompts. The users, in many cases, are actively trying to fashion their own explanations. This is a missing part of the equation, the appreciation that users are sensemakers and are attempting to figure out why their AI/ML systems are making surprising recommendations (Mueller et al., 2019).
For that reason, we want to encourage AI/ML developers to support this sensemaking activity, and to help users do a better job of self-explaining.
The Self-Explaining Scorecard (Hoffman et al., in preparation; Klein et al., 2021) tries to encourage the developers to do a better job of supporting the sensemaking of their users. The Scorecard presents an ordinal scale of the types of features that can be inserted into their programs, shown in Figure 5.
The lowest level of the scale is the null category — nothing is offered at all to assist with self-explaining. Unhappily, in the published literature we have reviewed, most of the descriptive reports fall into this category.
One step up, Level 1, is to provide some potentially helpful features such as saliency maps or annotations of features. These act as scaffolding for self-explaining, although there’s not much evidence that they are effective (Vorm et al., 2021).
The second level of the scorecard presents success stories, demonstrations and diagrams describing how the AI/ML system works. This is the upper left-hand corner of the Mental Model Matrix in Figure 1.
Level 3 of the Scorecard presents instances of failures so that users can form an impression of their limitations and boundary conditions. This is the upper right-hand corner of the Mental Model Matrix.
More advanced, level 4, covers the AI/ML systems that offer the decision logic and AI reasoning — giving users a glimpse into the factors and calculations that AI/ML systems are deploying when generating their recommendations. This level lets users gain some appreciation of how the AI system is making decisions. Despite the opacity of many AI/ML systems, it is still possible to show some of the goal tradeoffs they perform.
Level 5 systems provide insights into the reasons for the failures — the diagnoses of what went wrong.
Level 6 is a jump to exploration. Here, the user can query the categories, features, concepts and events, can study contrasts, can examine counterfactuals, and can manipulate various aspects of the system to see the effects, permitting the users to manipulate the inputs into the systems and the weights used by the systems to “break” the systems — to see what it would take for the systems to perform poorly.
Level 7 is the highest rank in our scale, an interactive adaptation aimed at modifying the AI/ML system and at the same time modifying the users’ mental model of the system.
The purpose of the Self-Explaining Scorecard is to help system developers to be ambitious in what they provide to users, and encourage sponsors who fund the AI development to encourage this ambition.
Figure 5. The Self-Explaining Scorecard. For larger version see https://tinyurl.com/SelfExplScore.
Discovery Platform. This tool was developed rather by accident. One of us (GK) was talking with Bill Ferguson, a member of the BBN/Raytheon development team during the DARPA XAI program, about the problem that designers of advanced AI/ML systems themselves don’t fully understand their system’s outputs. The systems absorb and digest hundreds of thousands, sometimes millions of examples but the tuning is invisible to the designers.
Bill was trying to figure out the kinds of images and questions his AI/ML system did well with, and the kinds it failed at. The task Bill was using was for users to judge whether the AI/ML system was likely to correctly categorize the things depicted in a photograph. So, Bill laboriously searched through his database and identified a few rules of thumb. For example, if the query was about location, users should guess that the AI/ML would get it right because his system was very good at identifying where an action was taking place — a bedroom or kitchen or dining room. If the query was about a sport being played or an animal that was depicted, users should bet on the system because there are so many labeled sports and animal photographs on the internet that his team’s system had digested. Bill even demonstrated the value of these rules of thumb because he explained them to novice users of his system and found that the rules of thumb let the users do a better job of guessing whether the AI/ML system’s judgment was accurate.
That conversation sparked the idea of a Discovery Platform to help AI/ML developers make similar insights. We could use Bill’s approach, and his database, and apply it to other AI/ML systems. It seemed like such a good idea. Unhappily, it turned out to be a bad idea because the database Bill was using wasn’t designed for making discoveries. It was designed to analyze performance data. And Bill had to work much too hard to squeeze discoveries out of it.
Nevertheless, Author GK wasn’t ready to give up on the Discovery Platform concept. By determining Bill’s frustration with his database, we formed a picture of what he really needed. We formed the general image of a system that could be deployed by designers and even by users to achieve the framework of the Mental Model Matrix: show how the system worked, show its limitations and boundary conditions, and offer some suggestions for ways to avoid the failures.
Bill investigated contrasts, such as the photographs the system failed at even though it usually got that category right. For example, his AI/ML system was very accurate with photographs of soccer players. So which soccer photographs did it miss? Bill studied the misses and realized that they were all about indoor soccer games. These are the kinds of contrasts Bill wanted to easily access. He also wanted to collect cases for which the system really struggled, to see if that gave him any clues.
The AI system under scrutiny had to make it easy to study commonalities in order to infer general themes.
We wanted a system that let people spot exceptions, anomalies, and outliers. Not just the data, but actual images to help designers notice things they’d been missing. Case exploration is what developers do now, but their databases are not designed with the selection and filtering of cases in mind. To apply the Discovery Platform, developers input their cases, what the ground truth categories are, and what the AI outputs were. Developers can encourage feedback from system users by including short surveys to learn about how the system fared in different situations. In doing this, they can learn which interactions were classified as successful and which ones were failures, which can provide critical context when learning about the system and making improvements to it.
The system needed to let designers examine failures — cases that users guessed wrong — in the hope of diagnosing what the users were missing.
And Bill wanted to see actual photographs, actual instances, and not just summary data. He wanted to see thumbnails. But he also wanted to easily switch back and forth between the statistics and the thumbnails.
And that is what Author SM developed (Mueller et al., 2021).
In short, the Discovery Platform can let developers and operators learn the strengths and limitations of an AI system by examining the history of its use. It is an agnostic platform that does not need any customization. It lets developers discover the strengths and limitations of an AI system by exploring content features: failure cases, confusions, outliers, contrasts, e.g., failures in classes in which the AI/ML usually performs well and successes in classes in which the AI/ML usually does poorly. It provides a capability to shuttle between a statistical view and a representation of instances.
Here is an illustration of the Discovery Platform, taken from Mueller et al. (2021). Mueller had little trouble building and training an AI/ML system that could identify numbers. With very high accuracy, if you showed it a handwritten numeral it could identify what the actual number was. Figure 6 shows the thumbnails of the images Mueller’s system got right, for the number “4.” These are the success stories. The Discovery Platform easily collects these examples.
Figure 6. Examples of correctly identified 4s.
The Discovery Platform statistically determined the errors the system made — the “4” examples that it missed. The system found that the most common misses were to mislabel a “4” as a “9”. Why did this happen? The Discovery Platform readily pulled up a random sample of this type of error, shown in Figure 7. Judge for yourself what is going wrong.
Figure 7. Correctly identified 4’s versus 4s mistaken for 9s by classifier.
You can watch the Discovery Platform in action in this YouTube demonstration: https://www.youtube.com/watch?v=qRQb-fa0N5A.
You can access and play with the Discovery Platform system: obereed.net:3838/mnist/.
Finally, you can download the Discovery Platform software for free: https://github.com/stmueller/xai-discovery-platform.
Enjoy.
Summary
Too often existing AI/ML driven analytic solutions are unobservable and unexplainable from the end-user’s perspective. As such, they fail to account for how humans use them to make sense of their work, and they miss the interactions between the human, analytics, and the environment.
Transparent and explainable decision support tools can facilitate more efficient information exchange, tasking, and more effective planning and re-planning strategies for adapting to real-time challenges. The AIQ toolkit can ideally support developers in promoting meaning, comprehension, and context into AI/ML technologies moving forward.
AIQ is a toolkit of straightforward, easy-to-understand, non-algorithmic methods for unpacking AI/ML systems and helping users gain appropriate trust in the systems they are operating.
The AIQ toolkit contains a number of tactics. This essay described five of the primary methods. These are the Cognitive Tutorial, ShadowBox, the collaborative XAI (CXAI) method, the Self-Explaining Scorecard, and the Discovery Platform.
Without a support capability such as AIQ, AI/ML developers are often working in the dark, uncertain about what their programs have learned and what are their strengths and limitations.
Without a support capability such as AIQ, the people using the AI/ML systems are flying blind, concentrating on getting the system to do what it is supposed to (the upper left-hand quadrant of the Mental Model Matrix), but unsure and ignorant of the flaws and limitations and boundary conditions of the system (the upper right-hand quadrant). Without having some sense of these limitations, the users are typically unable to perform workarounds that might be necessary (the lower left-hand quadrant). And the users are further hamstrung by confusions and misapprehensions about the AI/ML system and its limitations (the lower right-hand quadrant).
We are not claiming that AIQ is the only set of methods to help developers and users, and we are not even claiming that it is the best set of methods. Our claim is that in the face of proliferating AI/ML programs and support systems, capabilities such as AIQ may be important and even necessary. Just as we have learned to become uncomfortable if we somehow fail to buckle our seatbelts when driving on expressways and even on neighborhood streets, users should learn to be uncomfortable when asked to work with AI/ML systems without being provided with AIQ-type protections. We look forward to results when researchers and practitioners develop and apply the AIQ tools, so that the suite of tools can be improved and expanded.
Finally, sponsors and AI/ML developers might consider using AIQ tools or similar tools in order to protect their investment and reduce the chances of having users reject their systems unnecessarily.
Figure 8. Summary of AIQ Tools.
Acknowledgement
The authors appreciate the very helpful guidance and suggestions of the editors.
References
Mueller ST & Klein G. (2011). Improving users’ mental models of intelligent software tools. IEEE Intelligent Systems, 26(2), 77–83.
Borders, J., Klein, G., & Besuijen, R. (2022). Mental Model Matrix. [Manuscript submitted for publication]
Borders, J., Klein, G., & Besuijen, R. (2019). An operational account of mental models: A pilot study. International Conference on Naturalistic Decision Making, San Francisco, CA.
Hoc, J. M., Cacciabue, P. C., Hollnagel, E., & Cacciabue, P. C. (Eds.). (2013). Expertise and technology: cognition & human-computer cooperation. Psychology Press.
Hoffman, R.R. (2017). A taxonomy of emergent trusting in the human-machine relationship. In P. Smith & R.R. Hoffman, Eds., Cognitive systems engineering: The future for a changing world (137–164). Boca Raton, FL: Taylor & Francis.
Hoffman, R.R., Jalaeian, M., Tate, C., Klein, G., and Mueller, S.T. (in preparation). Evaluating machine-generated explanations: A “scorecard” method for XAI measurement science.
Johnson, M., Vera, A. (2019) No AI is an Island: The Case for Teaming Intelligence. AI Magazine 40(1).
Klein, G., Hoffman, R.R., and Mueller, S.T. (2021). Scorecard for self-explaining capabilities of AI systems. Technical Report, DARPA, Explainable AI program.
Mueller, S. T., Hoffman, R. R., Clancey, W., Emrey, A., & Klein, G. (2019). Explanation in human-AI systems: A literature meta-review, synopsis of key ideas and publications, and bibliography for explainable AI. arXiv preprint arXiv:1902.01876.
Mueller, S., Klein, G., Hoffman, R., & Ferguson, B. (2021). The XAI Discovery Platform. Technical Report to the DARPA XAI Program, Task Area 2: Naturalistic Decision Making Foundations of Explainable AI Sub-Task 5.3 (AIQ Toolkit)
Klein, G., Hoffman, R., & Mueller, S. T. (2021, November 2). Scorecard for Self-Explaining Capabilities of AI Systems. https://doi.org/10.31234/osf.io/78wxn
Mamun, T. I., Hoffman, R. R., & Mueller, S. T. (2021). Collaborative Explainable AI: A non-algorithmic approach to generating explanations of AI. In International Conference on Human-Computer Interaction (pp. 144–150).
Klein G, & Borders J. (2016). The ShadowBox approach to cognitive skills training. Journal of Cognitive Engineering and Decision Making, 10(3), 268–280
Lesgold, A. M. (2012). Practical issues in the deployment of new training technology. In P. J. Durlach and A. M. Lesgold (Eds.), Adaptive Technologies for Training and Education. New York: Cambridge University Press.
Linja, A., Mamun, T. I., & Mueller, S. T. (2022). When Self-Driving Fails: Evaluating Social Media Posts Regarding Problems and Misconceptions about Tesla’s FSD Mode. Multimodal Technologies and Interaction, 6(10), 86.
Riley, V. (1996). Operator reliance on automation: Theory and data. In R. Parasuraman and M. Mouloua (Eds.), Automation Theory and Applications (pp. 19–35). Mahwah, NJ: Erlbaum.
The Standish Group (2014). “The Standish Group Report: Chaos.” Project Smart, The Standish Group International. [https://www.standishgroup.com]
Vorm, E.S., Aha, D., Karneeb, J., Floyd, M., Comps, D.J.Y., & Pazzani, M. (2021). DARPA Explainable Artificial Intelligence Program: Evaluation Final Report. US Naval Research Laboratory. www.nrl.navy.mil.
Woods, D., & Dekker, S. (2000). Anticipating the effects of technological change: a new era of dynamics for human factors. Theoretical Issues in Ergonomics Science, 1(3), 272–282.
Zachary, W., Hoffman, R. R., Neville, K., & Fowlkes, J. (2007). Human total cost of ownership: The penny foolish principle at work. IEEE Intelligent Systems, 22(2), 88–92.
Footnote
[1] Despite the primitive nature of early rule-based production programs they are traceable so developers can uncover which rules produced a given response. Neural net programs are more opaque and sometimes produce untraceable responses.