OpenAI and Anthropic are ignoring requests from websites in robots.txt files not to be scraped, Business Insider claims. Wired previously reported that the company behind AI search engine Perplexity also ignores such no-scrape requests.
According to Business Insider, OpenAI and Anthropic are ignoring requests from media publishers not to scrape their content for use as training data for their machine learning models. Both companies have previously said they will honor no-scrape requests in robots.txt files.
Business Insider does not write how it found out this information. The site does refer to an earlier article by Reuters. The news agency already wrote that several AI companies ignore robots.txt requests. The site based this on a study by TollBit, a start-up that mediates in licensing deals between AI companies and publishers. However, that article did not mention the names of AI companies that would ignore the robots.txt protocols.
On Wednesday, Wired wrote that AI search engine and chatbot Perplexity is requesting websites not to be scraped. ignores. The bot could provide summaries of website pages that the PerplexityBot should not visit based on requests in the robots.txt file. Perplexity would therefore use the content of such sites as source material, while Business Insider claims that OpenAI and Anthropic still train their chatbots with content from websites that have indicated that they do not want this.
Since last year, websites can indicate that they don't want their websites to just be scraped. This can be done by adding text to robots.txt, the text file that is part of web standards and provides instructions to non-human visitors. Tweakers publisher DPG Media, among others, prohibits the use of web crawlers in its robots.txt file. However, following these instructions is not mandatory.
Leave a Reply
You must be logged in to post a comment.