1. Insights
  2. AI Data
  3. Article
  • Share on Facebook
  • Share via email

Are we headed for an AI data shortage?

Posted February 16, 2023
Hour glass with sand falling

Artificial intelligence (AI) could be heading for a data deficit in the near future, forcing organizations to rethink the kind of data they use and how they use it.

According to a paper from Epoch, a research organization focused on investigating and forecasting the development of advanced AI, the amount of high-quality, unlabeled language data available on the internet — in this case referring to things like books, academic papers, Wikipedia and news articles — will "almost surely be exhausted before 2027 if current trends continue."

The stock of low-quality language and image data will be close behind, running out within the next 40 years, according to researchers. "Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available," write the authors.

The recent leap in machine learning has been built around large language models (LLMs) and often emphasizes larger models and more parameters. Efforts like DeepMind's Chinchilla, released in March 2022, have challenged this approach, arguing that machine learning innovation should be driven by more high-quality datasets rather than parameters. Still, access to data is a driving factor in AI innovation, and neither approach is immune to an AI data shortage.

Organizations leveraging machine learning need to understand how a potential data shortage could affect them in the short and long term, and what can be done to offset it.

What a data shortage means

For the past four years, improvement in language models has been driven by training them on more data. By Epoch's estimates, there are between 4.6 trillion and 17.2 trillion tokens available for training models. To put it in perspective, Chinchilla was trained on 1.4 trillion tokens. "Without further room to scale datasets, this would lead to a slowdown in AI progress," write the researchers at DeepMind.

Michael Hedderich, a researcher in machine learning and natural language processing at Cornell University, points out that access to labeled data (like ImageNet, a large dataset of annotated photos intended for computer vision research) and advancements in processing it more efficiently has helped fuel advancements in deep learning methods over the past few years.

More recently, he says, new approaches to processing unlabeled data have sparked models like BERT (Bidirectional Encoder Representations from Transformers) and GPT3 (Generative Pre-trained Transformer 3). Hedderich says it's likely a data deficit would help spur further innovation on how we gather and use information to train machine learning (ML) models.

"Based on the historic insight that once a certain limit is reached, like the availability of labeled data, new information sources are found, like focusing on unlabeled data," Hedderich says. "I could imagine that for the future, we will see a similar pattern for AI." He points out that researchers are already experimenting with sources for AI systems that are more complex than just images or texts, "like grounding them in real-life interactions or using reinforcement learning to (let) them explore complex environments."

But in the meantime, a lack of high-quality data could prove to be a major roadblock.

A question of quality

As datasets for training language models are scaled, quality control will become critical. "Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modeling loss, but also with downstream tasks," write the authors of DeepMind's paper on Chinchilla.

There are also ethical and privacy concerns with training trillions of tokens. "Large datasets scraped from the web will contain toxic language, biases and private information," according to the DeepMind team. "With even larger datasets being used, the quantity — if not the frequency — of such information increases, which makes dataset introspection all the more important."

As data demands grow, organizations using LLM-powered technologies will need to weigh the use of lower-quality datasets from user-generated content on social media and forums with the risks of bias and toxicity.

However, Hedderich says he's a bit hesitant with the terms "high- and low-quality data."

He points out that while social media posts and other random texts on the internet might often be of lower quality when measured using metrics similar to English class grading, they also represent ways in which humans communicate. "They can, therefore, also be an important source of information for the AI to learn from," says Hedderich. "After all, humans might also interact with the AI in casual forms and not in the way a Wikipedia article or a peer-reviewed article is written." It's worth noting that using a more varied range of training data could help make AI models more diverse and reduce bias.

Quality will continue to be a challenge in the short and long-term pursuit of better and more powerful AI learning models.

Keep the data pipeline flowing

The researchers at Epoch acknowledge that their prediction of a data shortage is contingent on our current trajectory. A societal or economic shift, like the large-scale adoption of autonomous vehicles, would generate an "unprecedented" amount of road video recordings. "Similarly, actors with big budgets (such as governments or large corporations) might be able to increase the production of data with enough spending, especially in the case of high-quality data for niche domains," write the authors.

New data sources like synthetic data (i.e., data that's artificially manufactured using computer models) could become critical. And more robust automatic quality metrics capable of mining high-quality data from low-quality sources could help meet data demands for AI training models.

Organizations looking to use AI systems across a range of applications — from advanced smart products and more accurate search, to expanded speech recognition and more human-like bot interactions — could benefit from finding partners that understand the potential impact of a data shortage.

But it's deeper than just anticipating the future. Organizations will want to look for partners that have the infrastructure in place to make sure the data they are using to train their models is high quality, annotated by a diverse team and filtered to limit toxic and biased inputs. TELUS Digital, for example, can collect and / or create diverse and representative datasets by harnessing the intelligence, skills and cultural knowledge of our global AI Community of contributors.

As it currently stands, it's easy to see our data stocks aren't infinite. But given the leap ML has made in the last few years, continued success in AI innovation isn't entirely dependent on resources; indeed, resourcefulness matters, too.


Check out our solutions

Power your NLP algorithms using our accurately annotated AI training data.

Learn more