Can AI Replace Human Research Participants? These Scientists See Risks

Several recent proposals for using AI to generate research data could save time and effort but at a cost

A large group of uncolored miniature human figurines stand in a grid formation on top of a green grid graphic

zepp1969/Getty Images

In science, studying human experiences typically requires time, money and—of course—human participants. But as large language models such as OpenAI’s GPT-4 have grown more sophisticated, some in the research community have been steadily warming to the idea that artificial intelligence could replace human participants in some scientific studies.

That’s the finding of a new preprint paper accepted for the Association for Computing Machinery’s upcoming Conference on Human Factors in Computing Systems (CHI), the biggest such gathering in the field of human-computer interaction, in May. The paper draws from more than a dozen published studies that test or propose using large language models (LLMs) to stand in for human research subjects or to analyze research outcomes in place of humans. But many experts worry this practice could produce scientifically shoddy results.

This new review, led by William Agnew, who studies AI ethics and computer vision at Carnegie Mellon University, cites 13 technical reports or research articles and three commercial products; all of them replace or propose replacing human participants with LLMs in studies on topics including human behavior and psychology, marketing research or AI development. In practice, this would involve study authors posing questions meant for humans to LLMs instead and asking them for their “thoughts” on, or responses to, various prompts.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


One preprint, which won a best paper prize at CHI last year, tested whether OpenAI’s earlier LLM GPT-3 could generate humanlike responses in a qualitative study about experiencing video games as art. The scientists asked the LLM to produce responses that could take the place of answers written by humans to questions such as “Did you ever experience a digital game as art? Think of ‘art’ in any way that makes sense to you.” Those responses were then shown to a group of participants, who judged them as more humanlike than those actually written by humans.

Such proposals often cite four main benefits of using AI to synthesize data, Agnew and his co-authors found in their new review. It could increase speed, reduce costs, avoid risks to participants and augment diversity—by simulating the experiences of vulnerable populations who otherwise might not come forward for real-world studies. But the new paper’s authors conclude that these research methods would conflict with central values of research involving human participants: representing, including and understanding those being studied.

Others in the scientific community are also skeptical about AI-synthesized research data.

“I’m very wary of the idea that you can use generative AI or any other kind of automated tool to replace human participants or any other kind of real-world data,” says Matt Hodgkinson, a council member of the Committee on Publication Ethics, a U.K.-based nonprofit organization that promotes ethical academic research practices.

Hodgkinson notes that AI language models may not be as humanlike as we perceive them to be. One recent analysis that has not yet been peer-reviewed studied how scientists refer to AI in 655,000 academic articles and found the level of anthropomorphism had increased 50 percent between 2007 and 2023. But in reality, AI chatbots aren’t all that humanlike; these models are often called “stochastic parrots” that simply remix and repeat what they have learned. They lack any emotions, experiences or true understanding of what they’re asked.

In some cases, AI-generated data could be a helpful complement to data gathered from humans, says Andrew Hundt, who studies deep learning and robotics at Carnegie Mellon University. “It might be useful for some basic preliminary testing” of a research question, he adds, with the synthetic data set aside in favor of human data once a real study begins.

But Hundt says using AI to synthesize human responses likely won’t offer much benefit for social science studies—partly because the purpose of such research is to understand the unique complexities of actual humans. By their very nature, he says, AI-synthesized data cannot reveal these complexities. In fact, generative AI models are trained on vast volumes of data that are aggregated, analyzed and averaged to smooth out such inconsistencies.

“[AI models] provide a collection of different responses that is basically 1,000 people rolled up into one,” says Eleanor Drage, who studies AI ethics at the University of Cambridge. “They have no lived experience; they’re just an aggregator of experience.” And that aggregation of human experience can reflect deep biases within society. For example, image- and text-generating AI systems frequently perpetuate racial and gender stereotypes.

Some of the recent proposals identified in the new review also suggested that AI-generated data could be useful for studying sensitive topics such as suicide. In theory, this could avoid exposing vulnerable people to experiments that might risk provoking suicidal thoughts. But in many ways, the vulnerability of these groups amplifies the danger of studying their experience with AI responses. A large language model role-playing as a human could very well provide responses that do not represent how real humans in the group being studied would think. This could erroneously inform future treatments and policies. “I think that is so incredibly risky,” Hodgkinson says. “The fundamental [problem] is that an LLM or any other machine tool is simply not a human.”

Generative AI may already be weakening the quality of human study data, even if scientists don’t incorporate it directly into their work. That’s because many studies use Amazon’s Mechanical Turk or similar gig work websites to gather human research data. Already Mechanical Turk–based responses are often seen as subpar because participants may be completing assigned experimental tasks as fast as possible to earn money rather than focusing closely on them. And there are early indications that Mechanical Turk workers are already using generative AI to be more productive. In one preprint paper, researchers asked crowd workers on the site to complete a task and deduced that between 33 and 46 percent of respondents used an LLM to generate their response.

Because there is no scientific precedent for using AI-generated rather than human data, doing so responsibly will require careful thought and cross-field cooperation. “That means thinking with psychologists—and it means thinking with experts—rather than just having a bunch of scientists have a go themselves,” Drage says. “I think there should be guardrails on how this kind of data is created and used. And there seem to be none.”

Ideally those guardrails would include international guidelines set by academic bodies on what is and isn’t acceptable use of LLMs in research or guidance from supranational organizations on how to treat findings reached from the use of AI-powered data.

“If AI chatbots are used haphazardly, it could deeply undermine the quality of scientific research and lead to policy changes and system changes based on faulty data,” Hodgkinson says. “The absolute, fundamental bottom line is the researchers need to validate things properly and not be fooled by simulated data—[or think] that it’s in some way a substitute for real data.”