Wired: Search engine Perplexity ignores deny scraping in robots.txt files

The company behind the AI search engine Perplexity ignores requests from websites not to be scraped, according to Wired and researcher Robb Knight. The start-up claims not to do this, but the researcher and the medium conclude that this is the case.

A publication by Knight would show that Perplexity can provide summaries of websites that the PerplexityBot should not be visited based on requests in the robots.txt file. Knight, on the other hand, was able to record that the company was using a masked bot to scrape the protected website, without sending a user-agent string to identify the bot.

Wired confirms the claims based on its own research. The news outlet asked the AI search engine and chatbot to summarize pages that were protected with the robots.txt file. Still, Perplexity was able to share information from the web pages. Wired's parent company also registered similar visiting behavior of a bot via an IP address that 'almost certainly' belongs to Perplexity. The company behind the AI service tells Wired that the article shows a 'misunderstanding' of the technology, but does not elaborate on the allegations.

Perplexity.ai is an AI tool that claims to use information from collects data from the internet and presents it to a user via a chatbot interface. The start-up behind the search engine says, like other major AI companies, it will honor requests in robots.txt files. In these so-called Robots Exclusion Protocol files, websites can indicate that they do not want visits from specific scrapers, also called web crawlers. Scrapers can be used to automatically collect content from the internet. Companies can use these scrapers to train their algorithms or, as in the case of Perplexity, as source material. Tweakers publisher DPG Media also refuses scraping in its robots.txt file.

Wired: Search engine Perplexity ignores deny scraping in robots.txt files

Comments

Leave a Reply