Home / News / Technology / OpenAI, Google, and Meta Accused of Bending Rules to Feed AI Data Frenzy 
Technology
4 min read

OpenAI, Google, and Meta Accused of Bending Rules to Feed AI Data Frenzy 

Published April 8, 2024 4:52 PM
James Morales
Published April 8, 2024 4:52 PM

Key Takeaways

  • Major AI developers are running out of publicly available sources of AI training data.
  • The New York Times has reported that Google and OpenAI are transcribing YouTube videos to create training data.
  • Meanwhile, Meta reportedly changed its policy on using copyrighted material to keep up with OpenAI.

For modern AI developers, high-quality training data is becoming an increasingly rare resource. Now, some of the largest players in the space are turning to increasingly dubious means to procure fresh supplies of human-generated text they can use to train their foundation models.

According to company insiders cited by the New York Times, both OpenAI and Google have started transcribing YouTube videos to generate new training data. If the allegations are true, the practice could open up a whole new frontier in the war between copyright owners and AI developers.

The Next AI Copyright Feud?

While the field of AI copyright law is still in its infancy, several ongoing cases pit copyright owners (including the New York Times) against companies that have used their intellectual property to feed AI models’ insatiable appetite for data.

On one side, copyright owners, including authors, publishers, artists and musicians, have argued that using their intellectual property without payment or consent amounts to copyright infringement. 

On the other, AI developers insist that training is fair use and that copyright law doesn’t prevent them from using publicly available resources. 

Now, the news that YouTube videos are being tapped as AI training data throws brings yet another medium into the fold, raising a fresh set of questions over what counts as fair use.

Firms Move Ahead Despite Copyright Uncertainties

Even before they turned to YouTube as a new source of training data, AI developers were being advised by lawyers that using copyrighted material could make them vulnerable to lawsuits.

Between Google, OpenAI and Meta, dozens of lawsuits have already been filed accusing them of intellectual property theft. 

Nonetheless, developers have opted not to wait for the courts to establish any clear precedents on the matter, preferring to plow on gathering copyrighted data from across the internet, even if that means facing lawsuits.

According to the Times, Meta initially resisted using copyrighted material as training data but changed course after it became apparent that  OpenAI was doing it.

“The only thing that’s holding us back from being as good as ChatGPT is literally just data volume,” Nick Grudin, a vice president of global partnership and content, reportedly said in one meeting. As such, Meta could follow should follow its peer in what had become the “market precedent,” he added.

Developers Tap Torrent Sites

Perhaps the most controversial sources of training data are massive book depositories downloaded from illegal file-sharing platforms like Library Genesis.

For ordinary Americans, torrenting copyright-protected books is illegal. But AI developers have embraced the vast libraries of texts available on torrent sites anyway.

For instance, Nvidia is currently being sued over its use of data set known as Books3, which consists of 196,640 books downloaded from the bibliotik BitTorrent tracker. 

Books3 also sits at the center of a similar lawsuit filed against Meta by a group of authors including comedian Sarah Silverman and Michael Chabon.

According to the Times, Meta initially resisted using copyrighted material as training data, but changed course after it became apparent that  OpenAI was doing it.

“The only thing that’s holding us back from being as good as ChatGPT is literally just data volume,” Nick Grudin, a vice president of global partnership and content, said in one meeting.

OpenAI appeared to be taking copyrighted material and Meta could follow this “market precedent,” he added.

Was this Article helpful? Yes No