AI Companies Bypassing Web Standards to Scrape Publisher Sites, Says Licensing Firm

Forbes accused Perplexity of using its investigative stories in AI-generated summaries without attribution or permission.

New York: Multiple artificial intelligence companies are bypassing a standard web protocol used by publishers to block content scraping for use in generative AI systems, according to content licensing startup TollBit.

In a letter seen by Reuters on Friday, TollBit informed publishers of this practice, which includes AI companies sidestepping the Robots Exclusion Protocol (robots.txt). The letter arrives amidst a public dispute involving AI search startup Perplexity and Forbes over similar issues, highlighting broader tensions between technology firms and media outlets regarding content usage in the era of generative AI.

Forbes accused Perplexity of using its investigative stories in AI-generated summaries without attribution or permission. A recent Wired investigation found Perplexity likely evading attempts to block its web crawler via robots.txt.

Perplexity declined to comment on the matter when approached, as reported by the international news agency Reuters.

TollBit, an early-stage startup, positions itself as a mediator between AI firms hungry for content and publishers interested in negotiating licensing deals. The company uses analytics to track AI traffic on publishers’ sites, facilitating agreements on fees for various types of content usage.

According to TollBit’s letter, the issue extends beyond Perplexity, with “numerous” AI agents disregarding robots.txt directives to access content.

“This pattern emerges as we analyze more publisher logs,” TollBit stated. The robots.txt protocol, established in the mid-1990s, helps prevent websites from being overloaded by web crawlers, though compliance has historically been voluntary.

Also Read | U.S. and China Resume Informal Nuclear Talks After Five-Year Hiatus

Recently, robots.txt has become crucial for publishers seeking to block tech companies from using their content without compensation in generative AI systems. AI companies utilize such content for algorithm training and real-time article summarization.

Some publishers, like the New York Times, have pursued legal action against AI companies for copyright infringement, while others opt for licensing agreements. Disputes often arise over the perceived value of the content. Many AI developers argue that accessing publicly available content for free does not violate laws.

Also Read | Russia Calls for Comprehensive Security Talks with US, Insists Ukraine Be Included

Thomson Reuters, owner of Reuters News, is among those licensing news content for AI applications.

Publishers remain concerned, particularly since Google introduced AI-driven summaries in response to search queries last year. To prevent inclusion in these summaries, publishers must adopt measures that could also impact their visibility on Google search results.

Recent News