Sending GenAI Into the Wild

MIT’s Sinan Aral describes experiments with GenAI in large-scale applications to measure real-world productivity, human collaboration and trust

MIT IDE
MIT Initiative on the Digital Economy
5 min readMay 23, 2024

--

Image: Pixabay

By Paula Klein

It’s time for real-time AI.

To keep pace with the speed of GenAI advancements, tests are moving beyond laboratory experiments to see how smart and reliable AI can be in real-world applications. These ‘road tests’ can also help businesses estimate how much productivity to expect from their AI investments — even as the technology is being perfected on the fly.

At the MIT Initiative on the Digital Economy (IDE), several efforts are under way to measure AI productivity, human collaboration and trust. IDE Director and MIT Sloan Professor Sinan Aral, told attendees at this week’s 2024 Annual Conference that “we really want to get a handle on applied GenAI and what it means for business.”

Aral, who also heads the Generative AI and Decentralization Group at the IDE, said he “applauds” academic and industry research to understand the productivity effects of GenAI. At the same time, “the rubber meets the road in large scale field experiments where you’re actually using GenAI in the wild; where anything can go wrong and anything can go right.”

Toward that end, Aral described two current GenAI experiments taking shape in his research group.

One project is an AI platform called MindMeld developed by MIT Sloan Post-Doctoral Associate Harang Ju and PhD student, Michael Caosun. The online platform pairs users with AI large language models (LLM) to collaborate on any number of tasks. The humans either work with GenAI assistants or with other human beings, Aral said, “and we measure how they collaborate differently — human-to-human or bot-to-human, and how their productivity differs on the task as it unfolds.”

For instance, AI and human partners created advertising copy for business ads that will appear in actual media. “We’ll be measuring how many ads are produced to get a quantitative measure of productivity, as well as the quality of the ads based on how they perform on click-through and view-through rates in the real world,” Aral said.

The goal is to have a cache of “very rich data about how people collaborate with bots or humans and compare the outcomes.” Another test on the platform will use GenAI to assist human negotiators and then measure collaboration results.

In Search We Trust

Aral also described a new project that examines when humans trust generative AI, and in particular, generative search, which is becoming more ubiquitous.

It’s not uncommon to get a GenAI result when searching the web. The results, generated by a LLM, sometimes include links to other references; sometimes there is feedback from other users on whether the results were helpful. In the IDE field experiment, results may be flagged to highlight certainty: A sentence from the AI may indicate that it is less sure — or very sure — of its output.

Aral noted that the nature of search queries vary greatly, of course. Some are silly or mundane, but humans also type very consequential questions into search engines — for instance gathering policy information that might help them decide how to vote in the next election, or safety tips for child care.

The key questions Aral’s team studied are: Can we trust GenAI, should we trust it, and when do we trust it?

In the test, nearly 5,000 users randomly got either generative search results or conventional search results to 50,000 online queries, and then were asked how much they trusted the results and how willing they were to share that information.

Based on preliminary survey results, Aral reported that people generally distrust the generative results versus traditional search results when told which responses are generated by AI.

This holds true, he noted, even when traditional results and GenAI results are identical, word for word, just arranged differently.

“All the information is identical. So it shouldn’t be any more or less trustworthy,” he said. But when told that the information is generated by AI, “users tend to trust it less, on average.” Also, those with lower education levels were less likely to trust AI, while those who work in tech or have greater experience with GenAI were more likely to trust it.

Adding citations or references isn’t always the solution, Aral noted. While people tended to trust AI more when references were included, those additions may not be accurate.

Apparently, “the veneer of rigor is more important than rigor itself,” giving people a sense of trust in AI “even when it’s not warranted.”

Overall, Aral noted that there are clear pros and cons to generative information. On one hand, it’s adaptable and flexible because it can be queried multiple times. It’s very specific and responsive, and potentially it can be very rigorous because it’s gathering a lot of diverse information.

On the other hand, it is potentially a source of misinformation. Aral acknowledged that GenAI “hallucinates,” meaning that sometimes it makes references to research and scientific papers that aren’t real (also known as bunk rigor). The information “looks like it’s authoritative, but it’s really not,” he said.

Taken together, IDE’s new GenAI field studies offer real-world lessons about how to become more educated in the strengths — such as collaboration and productivity — as well as the weaknesses at a time when advancements race ahead.

--

--

MIT IDE
MIT Initiative on the Digital Economy

Addressing one of the most critical issues of our time: the impact of digital technology on businesses, the economy, and society.