Reddit Accuses Perplexity AI of Illegally Harvesting Billions of Posts for AI Training

Abdulafeez Olaitan
4 Min Read

Reddit has filed a federal lawsuit against Perplexity AI, accusing the artificial intelligence firm and its data partners of carrying out what it describes as an “industrial-scale” operation to scrape massive amounts of user-generated content from the platform. The complaint, filed on Wednesday, also names SerpApi, Oxylabs, and AWM Proxy as co-defendants, alleging that the companies built and deployed tools to bypass Reddit’s and Google’s anti-scraping protections to harvest data for AI model training.

According to the filing, the defendants collectively extracted nearly three billion Google search result pages containing Reddit data. Reddit claims these scraping tools were specifically designed to evade security controls and systematically copy user content from the site’s public pages, thereby violating its Terms of Service. The lawsuit further alleges that Perplexity continued to use Reddit data even after receiving a cease-and-desist letter in May 2024, incorporating it into the company’s “answer engine” product to train and enhance its AI capabilities.

This marks the second major legal action Reddit has taken against an AI company in 2025. Earlier this year, it filed a similar lawsuit against Anthropic, accusing the firm of unlawfully scraping its platform to train its Claude AI model. In that case, Reddit alleged that Anthropic accessed its servers more than 100,000 times even after it had publicly claimed to have stopped.

In response to the new lawsuit, Perplexity stated on Reddit, calling the case “a sad example of what happens when public data becomes a big part of a public company’s business model.” The company argued that Reddit’s interpretation of ownership over publicly visible content contradicts the principle of an open internet. “By their logic, if you refer to any public Reddit link, they just might sue you too,” a Perplexity representative remarked.

Other defendants also rejected Reddit’s accusations. A spokesperson for SerpApi said the company had received no formal communication from Reddit and would challenge the claims in court. Oxylabs’ governance and strategy chief, Denas Grybauskas, criticised Reddit’s move as an attempt to monopolise publicly accessible data, asserting that “no company should claim ownership of public data that does not belong to them.”

Legal experts say the case raises complex questions about ownership, consent, and fair use in the age of AI. Attorney Andrew Rossow explained that while users may grant Reddit a license to host or distribute their content, that does not automatically permit third-party AI firms to use it for training. Courts will likely need to determine whether Reddit’s Terms of Service explicitly cover such uses and whether the AI companies’ actions constitute unauthorised access or data misuse.

Rossow further noted that the issue touches on broader ethical questions about how AI firms source their training data. He argued that treating online communities as “free raw material” ignores the time and creativity of the people who generate that content. “The supposed knowledge behind large-language models is the product of millions of users’ effort,” he said, adding that AI developers must learn to “respect digital citizenship and community norms.”

Share This Article
Abdulafeez Olaitan is a communication specialist with quality experience in digital media as a writer, journalist and editor. He has been nominated for the Rhysling Award, Pushcart Prize and Best of the Net Award. Contact: Abdulafeez.Olaitan [at] news.ng