Research Summaries Written by AI Fool Scientists

Scientists cannot always differentiate between research abstracts generated by the AI ChatGPT and those written by humans

By Holly Else & Nature magazine

AI Brain over keyboard illustration — Olemedia/Getty Images

An artificial-intelligence (AI) chatbot can write such convincing fake research-paper abstracts that scientists are often unable to spot them, according to a preprint posted on the bioRxiv server in late December¹. Researchers are divided over the implications for science.

“I am very worried,” says Sandra Wachter, who studies technology and regulation at the University of Oxford, UK, and was not involved in the research. “If we’re now in a situation where the experts are not able to determine what’s true or not, we lose the middleman that we desperately need to guide us through complicated topics,” she adds.

The chatbot, ChatGPT, creates realistic and intelligent-sounding text in response to user prompts. It is a ‘large language model’, a system based on neural networks that learn to perform a task by digesting huge amounts of existing human-generated text. Software company OpenAI, based in San Francisco, California, released the tool on 30 November, and it is free to use.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Since its release, researchers have been grappling with the ethical issues surrounding its use, because much of its output can be difficult to distinguish from human-written text. Scientists have published a preprint² and an editorial³ written by ChatGPT. Now, a group led by Catherine Gao at Northwestern University in Chicago, Illinois, has used ChatGPT to generate artificial research-paper abstracts to test whether scientists can spot them.

The researchers asked the chatbot to write 50 medical-research abstracts based on a selection published in JAMA, The New England Journal of Medicine, The BMJ, The Lancet and Nature Medicine. They then compared these with the original abstracts by running them through a plagiarism detector and an AI-output detector, and they asked a group of medical researchers to spot the fabricated abstracts.

Under the radar

The ChatGPT-generated abstracts sailed through the plagiarism checker: the median originality score was 100%, which indicates that no plagiarism was detected. The AI-output detector spotted 66% the generated abstracts. But the human reviewers didn't do much better: they correctly identified only 68% of the generated abstracts and 86% of the genuine abstracts. They incorrectly identified 32% of the generated abstracts as being real and 14% of the genuine abstracts as being generated.

“ChatGPT writes believable scientific abstracts,” say Gao and colleagues in the preprint. “The boundaries of ethical and acceptable use of large language models to help scientific writing remain to be determined.”

Wachter says that, if scientists can’t determine whether research is true, there could be “dire consequences”. As well as being problematic for researchers, who could be pulled down flawed routes of investigation, because the research they are reading has been fabricated, there are “implications for society at large because scientific research plays such a huge role in our society”. For example, it could mean that research-informed policy decisions are incorrect, she adds.

But Arvind Narayanan, a computer scientist at Princeton University in New Jersey, says: “It is unlikely that any serious scientist will use ChatGPT to generate abstracts.” He adds that whether generated abstracts can be detected is “irrelevant”. “The question is whether the tool can generate an abstract that is accurate and compelling. It can’t, and so the upside of using ChatGPT is minuscule, and the downside is significant,” he says.

Irene Solaiman, who researches the social impact of AI at Hugging Face, an AI company with headquarters in New York and Paris, has fears about any reliance on large language models for scientific thinking. “These models are trained on past information and social and scientific progress can often come from thinking, or being open to thinking, differently from the past,” she adds.

The authors suggest that those evaluating scientific communications, such as research papers and conference proceedings, should put policies in place to stamp out the use of AI-generated texts. If institutions choose to allow use of the technology in certain cases, they should establish clear rules around disclosure. Earlier this month, the Fortieth International Conference on Machine Learning, a large AI conference that will be held in Honolulu, Hawaii, in July, announced that it has banned papers written by ChatGPT and other AI language tools.

Solaiman adds that in fields where fake information can endanger people’s safety, such as medicine, journals may have to take a more rigorous approach to verifying information as accurate.

Narayanan says that the solutions to these issues should not focus on the chatbot itself, “but rather the perverse incentives that lead to this behaviour, such as universities conducting hiring and promotion reviews by counting papers with no regard to their quality or impact”.

This article is reproduced with permission and was first published on January 12 2023.