Hey there, website owners! Do you know that search engines and other online services often use AI crawlers to check out what’s on your site? These crawlers, deployed by giants like OpenAI and Google, collect data to train their evolving artificial intelligence (AI) models.
If you wish to exercise greater control over who gets to see and use your content, read on. We’ll guide you on how to adjust your site’s robots.txt
file to fend off these AI web crawlers. Keep reading; a step-by-step guide is up next. 👀

AI training isn’t necessarily a bad thing, but if you’re concerned about the ethical and legal implications of AI training data sourcing, the ability to block OpenAI’s and Bard web crawlers is a crucial first step. It won’t remove any content previously scraped, but it’s a starting point in a landscape increasingly concerned with data privacy and consent.
💡 Before we dive in, let’s quickly understand what a robots.txt
file is. Think of it as the bouncer at the door of your website. It tells crawlers which pages they can visit and which ones they can’t. This file sits in the main folder of your site, so crawlers can find it right away.
OpenAI Crawlers
Start Here: What OpenAI’s Update Means for Your Website
OpenAI has recently announced a feature that allows website operators to block its GPTBot web crawler from scraping content to help train its language models, like GPT-3 or GPT-4. This means you can now explicitly disallow OpenAI’s crawlers in your site’s robots.txt
file.
🔊 What OpenAI Says
According to OpenAI, the crawled web pages may potentially contribute to future models, although the company filters out content behind paywalls, or content known for gathering personally identifiable information (PII).
However, opting out could be a significant step towards user privacy and data protection.
📌 How to Block OpenAI’s Crawler
- Find Your
robots.txt
File: This file is usually in the root directory of your website. If you can’t find it, you might need to create one. - Edit the File: Open the
robots.txt
file with a text editor. If you’re creating a new one, you can use any plain text editor like Notepad on Windows or TextEdit on a Mac. - Add the Rules: Add the following line to your
robots.txt
file (This will tell the OpenAI crawler to not crawl any pages on your website.):User-agent: GPTBot
Disallow: /
- Save and Upload: Save your changes and upload your
robots.txt
file back to your root directory. - Refresh Google’s robots.txt cache: Googlebot will not automatically detect changes to your
robots.txt
file. To force Googlebot to re-crawl your site, you can use the following command in the Google Search Console:https://www.google.com/webmasters/tools/robots?siteUrl=https://yourwebsite.com
- ✅ Once you have completed these steps, the OpenAI crawler will no longer be able to crawl your website for AI training.
Here are some additional things to keep in mind:
- You can also use the
Allow
directive in yourrobots.txt
file to allow the OpenAI crawler to access specific pages on your website.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
- If you have a large website, you may want to consider using a web crawler management tool to help you manage your
robots.txt
file. - You can also use other methods to prevent your website from being used for AI training, such as password protection or noindex tags.
🚨 Weak Website Security Can Cost You—Both Users and Reputation!
👉 Discover the 10 Fatal Website Mistakes You Must Avoid to Shield Your Reputation and Protect Your Users!
Google Bard Crawlers
The Emergence of Google Bard
In line with AI evolution, Google Bard has its set of crawlers that venture into websites for model training. Like OpenAI, Google recognizes the importance of user privacy and offers the choice to webmasters to block its crawlers.
🔊 What Google Bard Says
Google highlights the benefits of AI in improving their products and acknowledges the feedback from web publishers seeking more control. They introduced “Google-Extended,” a new tool for publishers to manage how their sites affect Bard and Vertex AI generative APIs. They emphasize transparency, control, and their commitment to engaging with the community for better AI applications.
📌 How to Block Google Bard’s Crawler
- Pinpoint Your robots.txt File: As before, it’s usually in the site’s root directory.
- Access and Edit: Utilize a text editor to make changes.
- Add the Rules: To block Google Bard, add the following line to your
robots.txt
file (This will tell the Google Bard crawler to not crawl any pages on your website.):User-agent: Google-Extended
Disallow: /
- Commit and Update: Save your modifications and replace the file in the root directory.
- Alert Google: As previously noted, remind Googlebot of the changes via the Search Console.
- ✅ Blocking Google Bard’s crawlers is now activated for your website.
Why Should You Do This?
You might wonder why you should bother doing this. Well, by updating your robots.txt
file, you take control. You decide who can look at your site’s content and who can’t. This can be especially important if you have sensitive information on your site that you don’t want to be part of AI training data.
Final Thoughts
It’s your website, and the choice of who gets to crawl it should be yours. By spending just a few minutes on your robots.txt
file, you can take control and prevent OpenAI’s and Google crawlers from exploring your content. It’s a simple yet effective step to protect your site.
Do You Run a Website or Blog?
Compliance Isn’t Optional, It’s Required! 👉 Discover here a Simple Guide to Laws and Regulations for Websites – and how to comply!
About us
The solution to draft, update and maintain your Terms and Conditions. Optimised for eCommerce, marketplace, SaaS, apps & more.