Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

The Internet Archive has often been a valuable resource for journalists, from it’s finding records of deleted tweets or providing academic texts for background research. However, the advent of AI has created a new tension between the parties. A few major publications have begun blocking the nonprofit digital library’s access to their content based on concerns that AI companies’ bots are using the Internet Archive’s collections to indirectly scrape their articles.

“A lot of these AI businesses are looking for readily available, structured databases of content,” Robert Hahn, head of business affairs and licensing for The Guardian, told Nieman Lab. “The Internet Archive’s API would have been an obvious place to plug their own machines into and suck out the IP.”

The New York Times took a similar step. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization,” a representative from the newspaper confirmed to Nieman Lab. Subscription-focused publication the Financial Times and social forum Reddit have also made moves to selectively block how the Internet Archive catalogs their material.

Many publishers have attempted to sue AI businesses for how they access content used to train large language models. To name a few just from the realm of journalism:

The New York Times sued OpenAI and Microsoft
The Center for Investigative Reporting sued OpenAI and Microsoft
The Wall Street Journal and New York Post sued Perplexity
A group of publishers including The Atlantic, The Guardian and Politico sued Cohere
The New York Times and the Chicago Tribune sued Perplexity

Other media outlets have sought financial deals before offering up their libraries as training material, although those arrangements seem to provide compensation to the publishing companies rather than the writers. And that’s not even delving into the copyright and piracy issues also being fought against AI tools by other creative fields, from fiction writers to visual artists to musicians. The whole Nieman Lab story is well worth a read for anyone who has been following any of these creative industries’ responses to artificial intelligence.

What's Hot

Imran Khan not the only one silenced as Pakistan military stifles dissent

Millions to get £150 off energy bills for further five years

Liverpool: Slot fires needless dig at Klopp; reveals transfer update after latest Frimpong injury

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

Amazon discovered a ‘high volume’ of CSAM in its AI training data but isn’t saying where it came from

Spotify has a group messaging feature now

Elon Musk’s SpaceX and xAI are reportedly holding merger talks

Subscribe to Updates

What's Hot

Imran Khan not the only one silenced as Pakistan military stifles dissent

Millions to get £150 off energy bills for further five years

Liverpool: Slot fires needless dig at Klopp; reveals transfer update after latest Frimpong injury

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

Related Posts

Amazon discovered a ‘high volume’ of CSAM in its AI training data but isn’t saying where it came from

Spotify has a group messaging feature now

Elon Musk’s SpaceX and xAI are reportedly holding merger talks