2 min read

We Could Run Out of Training Data for AI in Less Than 5 Years

We’re amidst a Creative AI Renaissance with tools that can assist us in everything from generating realistic talking-head videos to writing website copy and even completing code. Given this rapid pace, AI-generated content will eclipse human-made content within the next five years.

This Renaissance relies on the underlying premise that AI will continue to improve. The problem, though, is that AI is limited by the data we have to train it. And this supply is dwindling.

The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work. – MIT Tech Review

The paper doesn’t specify if this problem will persist for AI models other than generative language models. Nonetheless, it’s a looming dilemma prompting a few rhetorical questions.

Can they use low-quality training data from social media comments and posts?

Can they reuse the same training data twice or more?

Can AI be trained on data generated by AI? Or will this cause AI programs to regress?

Ultimately, we don’t want these models to hit a ceiling in just a few years. So these are big questions that could determine what happens once we reach this parity of training data.

On the other hand, in the field of computer programming, these worries seem to be non-existant, as teams double down on generative AI.

GitHub, which is owned by Microsoft, launched a tool called Copilot that suggests snippets of code and functions as developers type. Developers are using Copilot to generate up to 40% of their code, and GitHub expects that number to double within five years, Bloomberg reported earlier this month. – Business Insider

Google has numerous internal code-generating programs they’re working on to assist (and/or replace) developers.

  • Its subsidiary DeepMind has a system named AlphaCode that uses AI to generate code but is currently focused on competitive coding or writing programs at a competitive level.
  • Google is working on a tool similar to GitHub's Copilot that uses machine learning to generate code snippet suggestions as developers type. The tool improved coding iteration times among Google employees by 6%.
  • Google’s AI Developer Assistance program goes further by training systems to do more of the work themselves.

Furthermore, an internal Google project recently came to light that would allow them to update the codebase without hiring any software engineers:

Google is working on a secretive project that uses machine learning to train code to write, fix, and update itself. The project, which began life inside Alphabet's "X" research unit and was codenamed Pitchfork, moved into Google's Labs group this summer. By moving into Google, it signaled its increased importance to leaders.

Pitchfork was built for "teaching code to write and rewrite itself," according to internal materials seen by Insider. The tool is designed to learn programming styles and write new code based on those learnings, according to people familiar with it and patents reviewed by Insider. – Business Insider

Overall, we wouldn’t have this AI Renaissance without publicly-available data. With software engineering, a field that offers a lot of open-source data, running out of training data doesn’t appear to be a problem. But with language generation and image generation, it’s not so bright of a future. Especially considering we’re on pace to create 2.7 trillion images with AI by 2027, which is 4x the amount of all images on the Internet today.