How to block OpenAI from crawling your website
Want to stop the creators of ChatGPT from using your content to train their AI models? Here's how to block it by just copying just two lines of code in a text file.
Aug 15, 2023 • 4 Minute Read
- AI & Machine Learning
Not everyone was thrilled to learn that OpenAI, the creators of ChatGPT, had been training their AI on data taken from people’s websites without permission. While it’s too late to do anything about the data they’ve already crawled, you can stop these models from being trained on your current and future content — and all it takes is two lines of code.
However, just because you can block OpenAI from crawling your website, I would highly recommend asking the question if you should. For more on that, read this article: “Leaders: Don't prematurely block OpenAI from your websites.”
How ChatGPT crawls the web for content
OpenAI uses a web crawler called GPTBot to train their AI models (such as GPT-4). Web crawling is when an automated bot goes around collecting data on all the content on the internet. It happens all the time, and in fact, this is how Google works!
How to block GPTBot from crawling your site
The code below disallows GPTBot from accessing your site, and therefore stopping it from using your content for training purposes.
First, open your website’s Robots.txt file
If you’re not familiar with this concept, a robots.txt file lives at the root of your website. So, for www.pluralsight.com, it would live at www.pluralsight.com/robots.txt. This is a document that determines if web crawlers can crawl your website, and is always publicly accessible. For instance, if you wanted to stop Google from crawling something, you’d enter in:
User-agent: Googlebot Disallow: / User-agent: * Allow: /
The first two lines block the user agent called Googlebot from crawling your website. The remaining two lines allow any other bot to crawl your website. If you wanted to block only a certain part of your website, you might put the following in:
User-agent: Googlebot Disallow: /nogooglebot/
This would block Googlebot from crawling anything that starts with https://pluralsight.com/nogooglebot/
To set a Robots.txt file up:
Create a file called robots.txt (You can only have one of these files)
Add rules like the ones above to your file
Upload it to the root of your site
Stopping GPTBot from accessing your whole website
Now we’ve explained what a robots.txt file is, let’s block GPTBot. Add this code to your site’s robots.txt:
User-agent: GPTBot Disallow: /
Yes, it really is that simple.
Stopping GPTBot from accessing specific parts of your website
If you want to let GPTBot access certain parts of your website and not others, you’d enter in some code like the following:
User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
When ChatGPT may crawl your website, regardless of your robots.txt file
Currently, it’s unclear if web browsing versions of ChatGPT (such as “Browse with Bing”) or ChatGPT plugins will be prevented by your robots.txt file. That’s because this isn’t necessarily going through GPTBot.
What GPTBot won’t crawl, regardless of your robots.txt file
According to OpenAI, web pages crawled by the bot are filtered to “remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates (their) policies.”
That said, it’s a gamble to rely on GPTBot to not crawl these things, so the safest bet would be to use the above methods (and maybe don’t have PII publicly searchable in the first place).
How can I tell if my website has already been crawled to train an AI?
OpenAI has been notoriously tight-lipped about what sites GPT-4, the current AI model behind ChatGPT, was trained on. For competitive reasons, OpenAI has said they will not share the details of the “architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
In short, there’s no way to tell if your website was crawled to train GPT-4, so all you can do is take the precautions listed above if you don’t want your website data crawled to train an AI model (or at least, the ones built by OpenAI).
From reading this article, you should have a solid understanding of how the Robots.txt file works, and how to put in an entry to block OpenAI’s bot from crawling it to train AI models.
Further learning about ChatGPT and AI
Worried about ChatGPT? Being informed is the best way to make measured decisions on how to handle AI use at your organization. There are a number of courses that Pluralsight offers that can help you learn the ins and outs of AI — you can sign up for a 10-day free trial with no commitments. Here are some you might want to check out:
If you’re wondering how to deal with your company’s usage of ChatGPT and similar products, here are some articles that may help: