This Computer Scientist Seeks a Future Where AI Development Values Copyright

The new nonprofit Fairly Trained certifies that artificial intelligence models license copyrighted data—which often isn’t the case

Cover peeling back revealing coding representing artificial intelligence learning data

Just_Super/Getty Images

Using a powerful text- or image-generating artificial intelligence can feel like witnessing the mythical birth of Athena as she strides, fully formed and dressed in armor, out of Zeus’ forehead. Write a short prompt, and an instant later, lucid paragraphs or realistic images appear on the screen (joined, possibly soon, by convincing video). Those first impressions can be electrifying, as though your computer had been touched by a thunder god’s spark.

But there is another version of the Greek tale that paints Zeus less as a creator and more as a regurgitator. He eats his pregnant wife, Metis, who has done the labor of carrying Athena and has also forged her armor. Athena bursts forth only after Metis gives birth in Zeus’s mind. Generative AI systems can’t produce anything unless they, too, feed on what already exists. First, they atomize sentences, artwork and other content made by humans, and then they make connections among those digested bits. To learn how to generate text, for instance, OpenAI’s GPT-3.5, which powers the free version of the company’s popular ChatGPT, was trained on some 300 billion words that were scraped from posts on Wikipedia and other websites.

Several AI companies have argued that it is fair to train models this way without consulting or paying writers, photographers or other human creators. “AI development is an acceptable, transformative and socially beneficial use of existing content that is protected by fair use,” wrote Stability AI, which makes the popular image generator Stable Diffusion, in an October 2023 statement to the U.S. Copyright Office. A representative of the company told Scientific American that this remains Stability AI’s position.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


This fair use view is far from universal. Disagreement with it is the basis for disputes such as the New York Times’ lawsuit against Microsoft and OpenAI, which alleges that the technology companies unlawfully used the newspapers’ stories to make chatbots. The issue also motivated computer scientist Ed Newton-Rex to quit his job at Stability AI last November. He has since launched a nonprofit organization called Fairly Trained, which certifies companies that only train their generative AI models on copyrighted material when they get a license to do so.

“There is this divide emerging between companies that train, as I would say, fairly and those that don’t,” Newton-Rex says. But it can be difficult to discern how AI models were developed, he adds. For instance, within a given company, developers of an AI audio system might seek licenses, for instance, while their colleagues behind a text-generating large language model (LLM) might not.

Fairly Trained aims to make those distinctions clearer. Models from nine companies have obtained its annual certification so far, including Bria, an Israel-based AI firm that recently received $24 million in Series A funding. (The company says its image-making models are trained only on pictures that are licensed from sources such as the stock photography giant Getty Images.) Scientific American spoke with Newton-Rex to understand what criteria Fairly Trained uses to certify Bria and other companies and models—and why, despite his criticisms, he remains excited by AI’s future.

[An edited transcript of the interview follows.]

Last November you wrote on X that you had left your job at Stability AI because the company hadn't sought artists’ permission to use copyrighted material. As you left, were you thinking of making something like Fairly Trained?

I left Stability without a plan. After I’d left, there was more attention on my resigning than I anticipated. When I was talking to people, including journalists, after leaving, one of the things I was keen to point out was that the approach that Stability and many other companies take—training on work without consent—is not the approach everyone takes.... One thing that people were asking me about was: “Okay, if you say there are companies and models training in a different way, training in a fairer way, on work they’d license, where are those companies?”

I thought it would be relatively simple to put in place something that makes this more transparent. The easiest way of doing that, and the quickest way of doing that, would be to release a certification for companies who train on licensed data.

How does the Fairly Trained certification work?

Our first certification is what we’re calling the Licensed Model Certification. This can be acquired by any company that has trained a generative AI model without relying on copyrighted work that they haven’t gone and licensed or that they don’t have the rights to.

How you get the certification is by a written submission. We have a list of questions that you answer that addresses, primarily, two things: One, what is the training data for your model? And secondly, what are your internal processes that make sure you’re actually adhering to only using these training data and that your employees are using these training data...? Once we feel confident that we understand your processes and that we understand what's going into this model and that you have the [necessary] licenses in place..., we give you the certification.

A lot of this is trust-based right now. We don’t come in and dive into your systems.... But we think that’s an adequate mechanism, for now, to make a division between the kinds of companies taking what we think is the right approach and the companies who aren’t—because, honestly, the companies that don’t take this approach are pretty open about it.

There are plenty of generative AI models out there, and some of them are made by the largest tech corporations in the world. You've certified nine of them—all from relatively small companies. Why is that?

I intentionally went with smaller AI companies because, in general, for this kind of thing, they can move faster. You don’t have the kind of red tape that you have at some of the bigger companies.

Having said that, of course, clearly, many of the largest generative AI companies out there today couldn’t get certified because they don’t live up to this standard.

All of the first models you’ve certified involve music, audio or images. None generate text. Is there something inherent in the process of developing an AI chatbot that makes it more difficult to certify?

I don’t know of any large language model right now that could get certified. No one has even come close to releasing a model where all of the text is licensed or public domain or under the right sort of open license. There is a school of thought among generative AI proponents that this is all fair use, and they should get as much data as they can. And that involves scraping the Internet and getting all the text they can.

Unfortunately, in the past year and a half or couple of years, that has been the way that text generation and much generative AI has gone—partly because it’s been a race to develop the biggest and best models as quickly as possible, to get as much funding as possible, to become the big player in the space. And frankly, it’s because people feel they can get away with it.

In 2022 you used an LLM to create a choral accompaniment for piano, which you’ve called the first piece of published classical music set to generative AI text. (It was performed by the ensemble Voces8 in a London concert that was broadcast last August.) Knowing what you know now, would you do the same thing again?

I think I wouldn’t, to be honest. At the time everything was very experimental. No one really knew what was going on.

I should say: I am a proponent of generative AI. I think generative AI is a great thing. I would probably use it for one of the things that I think it’s very useful for, which is as a creative spark, as inspiration.

Describe to me the future of generative AI in which it can peacefully coexist alongside human artists and acts of creation.

There’s a big part of the world that got very excited about creating material from scratch—all of the people using Midjourney on Twitter [now known as X] to, you know, reimagine some painting in the style of Grand Theft Auto. I think that’s all a waste of time.

The exciting future is in this technology’s functionality as an assistive technology.... You actually can democratize creativity when you start to imagine the applications of this tech within the education system, if you can start to give people essentially personalized tutors—scalable, cheap, personalized tutors to teach them how to make things—especially in fields like music while music education has been on the decline in terms of funding in the U.K., where I come from.

But I think this should be done in a way that respects the creators behind the training data. Training data are one of the key three resources you need to build these systems: you need training data, you need GPUs [graphics processing units, chips that excel at running multiple computations simultaneously], and you need AI talent. People are coming in and spending millions of dollars on the latter two. I don’t see how it can be justified trying to get for free the other key resource, without which these systems would not work.

Ben Guarino is an associate technology editor at Scientific American. He writes and edits stories about artificial intelligence, robotics and our relationship with our tools. Previously, he worked as a science editor at Popular Science and a staff writer at the Washington Post, where he covered the COVID pandemic, science policy and misinformation (and also dinosaur bones and water bears). He has a degree in bioengineering from the University of Pennsylvania and a master's degree from New York University's Science, Health and Environmental Reporting Program.

More by Ben Guarino