Hey there, website owners! Do you know that search engines and other online services often use crawlers to check out what’s on your site? One of these is OpenAI, they use crawlers to collect data for training their artificial intelligence (AI) models.
If you’d rather not be part of this, we’ve got good news. You can stop them in their tracks with a simple tweak to a file on your site called robots.txt
. Keep reading; a step-by-step guide is up next. 👀
OpenAI has recently announced a feature that allows website operators to block its GPTBot web crawler from scraping content to help train its language models, like GPT-3 or GPT-4. This means you can now explicitly disallow OpenAI’s crawlers in your site’s Robots.txt
file.
According to OpenAI, the crawled web pages may potentially contribute to future models, although the company filters out content behind paywalls, or content known for gathering personally identifiable information (PII).
However, opting out could be a significant step towards user privacy and data protection.
AI training isn’t necessarily a bad thing, but if you’re concerned about the ethical and legal implications of AI training data sourcing, the ability to block OpenAI’s web crawlers is a crucial first step. It won’t remove any content previously scraped, but it’s a starting point in a landscape increasingly concerned with data privacy and consent.
Now that you’re up to speed on the context, let’s dive into the steps you can take to protect your site’s content.
Robots.txt
File?Before we dive in, let’s quickly understand what a robots.txt
file is. Think of it as the bouncer at the door of your website. It tells crawlers which pages they can visit and which ones they can’t. This file sits in the main folder of your site, so crawlers can find it right away.
Robots.txt
File: This file is usually in the root directory of your website. If you can’t find it, you might need to create one.robots.txt
file with a text editor. If you’re creating a new one, you can use any plain text editor like Notepad on Windows or TextEdit on a Mac.robots.txt
file (This will tell the OpenAI crawler to not crawl any pages on your website.):
User-agent: GPTBot
Disallow: /
robots.txt
file back to your root directory.robots.txt
file. To force Googlebot to re-crawl your site, you can use the following command in the Google Search Console:
https://www.google.com/webmasters/tools/robots?siteUrl=https://yourwebsite.com
👉 Discover the 10 Fatal Website Mistakes You Must Avoid to Shield Your Reputation and Protect Your Users!
Allow
directive in your robots.txt
file to allow the OpenAI crawler to access specific pages on your website.User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
robots.txt
file.You might wonder why you should bother doing this. Well, by updating your robots.txt
file, you take control. You decide who can look at your site’s content and who can’t. This can be especially important if you have sensitive information on your site that you don’t want to be part of AI training data.
It’s your website, and the choice of who gets to crawl it should be yours. By spending just a few minutes on your robots.txt
file, you can take control and prevent OpenAI’s crawlers from exploring your content. It’s a simple yet effective step to protect your site.
And there you have it! Now you know how to make sure OpenAI doesn’t use your site for its AI training. Simple, wasn’t it?
Compliance Isn’t Optional, It’s Required! 👉 Discover here a Simple Guide to Laws and Regulations for Websites – and how to comply!
The solution to draft, update and maintain your Terms and Conditions. Optimised for eCommerce, marketplace, SaaS, apps & more.