A Camera-Wearing Baby Taught an AI to Learn Words

Most machine-learning models rely on mountains of data to replicate human text, but new research suggests the recipe for learning language might be simpler

By Lauren Leffer

Photo of an 18mo baby wearing a head-mounted camera — Wai Keen Vong

By the time most children are two years old, they can understand about 300 words. By the age of four, the average vocabulary has ballooned to more than 1,000 words. Our species’ incredible capacity to quickly acquire words isn’t fully understood. Some cognitive scientists and linguists have theorized that people are born with built-in expectations and logical constraints that make this possible. Now, however, machine-learning research is showing that preprogrammed assumptions aren’t necessary to swiftly pick up word meanings from minimal data.

A team of cognitive and computer scientists has successfully trained a basic artificial intelligence model to match images to words using just 61 hours of naturalistic footage and sound—previously captured from the perspective of a child named Sam in 2013 and 2014. The study, published on Thursday in Science, used video and transcribed audio recorded by a head-mounted camera that was placed on Sam intermittently when he was six to 25 months old. Although it’s a small slice of a child’s life, it was apparently enough to prompt the AI to figure out what certain nouns mean.

The findings suggest that the recipe for language acquisition could be simpler than previously thought. Maybe children “don’t need a custom-built, fancy-pants language-specific mechanism” to efficiently grasp word meanings, says Jessica Sullivan, an associate professor of psychology at Skidmore College. Sullivan studies language development and was not involved in the new research, though she and others produced the video dataset that was used in the work. “This is a really beautiful study,” she says, because it offers evidence that simple information from a child’s worldview is rich enough to kick-start pattern recognition and word comprehension.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

The new study also demonstrates that it’s possible for machines to learn similarly to the way that humans do. Large language models such as GPT-4, the latest version of the AI that underlies ChatGPT, are trained on enormous amounts of data that can include billions and sometimes trillions of word combinations. Humans get by on orders of magnitude less information, says the paper’s lead author Wai Keen Vong, a computational cognitive researcher at New York University. With the right type of data, that gap between machine and human learning could narrow dramatically.

Brenden Lake, senior author of the study and an associate professor of psychology and data science at N.Y.U., agrees. “Today’s models don’t need as much input as they’re getting in order to make meaningful generalizations,” Lake says. “We showed, for the first time, that you can train an AI model to learn words through the eyes and ears of a single child.”

Lake, Vong and their colleagues started with a generic, multimodal machine-learning model made up of a vision encoder and a text encoder. Together the synchronized neural networks translated images and written language into the same mathematical space for their AI to interpret. The researchers fed their model 61 hours of Sam’s head-camera footage in the form of still frames, paired with transcribed text from the accompanying audio. Because the camera simply recorded what Sam saw and heard, the dataset seemed messy and somewhat random. It contained instances of caregivers speaking directly to the child, as well as background conversations between other people. The audio snippets often didn’t directly describe scenes or objects. Still, both Sam and the AI model managed to glean word meanings.

Across multiple tests, the model correctly matched many words with corresponding images. It also came close to the accuracy benchmark of two other AI models, both trained on vastly more language data. In one assessment, the scientists presented their basic model with batches of four images from the training set and asked it to point out which one contained a specific object, such as a ball. The AI was accurate about 62 percent of the time (much better than the 25 percent accuracy of the AI’s random guesses). The researchers also tested their model with new images of objects that were not from Sam’s recording repository—and the model was able to correctly identify many of those objects anyway, demonstrating the ability to generalize what it had learned. “We were quite surprised by that,” Vong says.

The study builds on past research in machine learning and human cognition. Previous AI studies have used data from multiple children to train models, and past developmental psychology experiments have evaluated individual children’s experiences, says Linda Smith, a psychology and brain science professor at Indiana University Bloomington. Although Sam’s dataset has also been used in other studies, Smith says the new work is “a real contribution” to science.

Sullivan agrees. “I was one of the people who thought that the problem of learning language is infinitely complex and that it wouldn’t be possible to learn a word’s meaning without having some specific machinery built into your mind,” she says. But this study has swayed her. “Now I see that, in at least one case, it is possible.”

Yet there are important limitations to what the new research reveals. For one, the scientists acknowledge that their findings don’t prove how children acquire words; the study only indicates what’s feasible for a machine—and what might also be feasible for a human. Although “it’s an elegant demonstration,” it’s not sufficient evidence of what’s happening when a child learns language, Smith says. Other factors beyond simple pattern recognition still likely contribute to human learning, she adds. And though the model managed to pick up tens of words, there were still many that it was unable to understand. For example, it was very good at correctly identifying “sand” and “car” but similar to or worse than random at identifying “hand” and “room.” Lake points out that these quirks don’t align with the kinds of words children learn most quickly, which suggests the model has nonhuman idiosyncrasies.

Moreover, the study only focused on recognizing the nouns for physical objects. Human language learning is much more complex than that, says Eva Portelance, a computational linguistics researcher at Mila–Quebec Artificial Intelligence Institute. Language also involves verbs, structure and abstract concepts that children begin to grasp early on from just their own experience. This research didn’t demonstrate that AI can do the same with the limited data the study model was trained on.

Still, it’s a step toward a deeper understanding of our own mind, Portelance says—which can ultimately help us improve human education. She notes that AI research doesn’t have to be only about maximizing bot capability and corporate profit; it can also bring clarity to long-unanswered questions about ourselves. “We can use these models in a good way: to benefit science and society,” Portelance adds.