Updated robots.txt Based on Content Signals to Open All Content for AI Training

This site, codenote.net, is operated as a platform for sharing knowledge and expertise related to software engineering. I have updated the site policy to make all content available for any use, including search engines, AI input, and AI training.

The reason is simple: I want to provide codenote.net content as context for AI. Additionally, if the content is used as training data for AI, I believe AI will return better answers in the future, which is also advantageous for me.

What is Content Signals?

For this policy update, I focused on an initiative called “Content Signals” led primarily by Cloudflare.

Content Signals proposes a standard method for website operators to clearly express their consent scope for content usage by AI crawlers and others through robots.txt. This aims to build a transparent and healthy ecosystem between AI developers and content creators.

robots.txt Update Details

Following the Allow Search, AI Input & AI Training policy of Content Signals, I updated this site’s robots.txt as follows:

# As a condition of accessing this website, you agree to
# abide by the following content signals:
 
# (a)  If a content-signal = yes, you may collect content
# for the corresponding use.
# (b)  If a content-signal = no, you may not collect content
# for the corresponding use.
# (c)  If the website operator does not include a content
# signal for a corresponding use, the website operator
# neither grants nor restricts permission via content signal
# with respect to the corresponding use.
 
# The content signals and their meanings are:
 
# search: building a search index and providing search
# results (e.g., returning hyperlinks and short excerpts
# from your website's contents).  Search does not include
# providing AI-generated search summaries.
# ai-input: inputting content into one or more AI models
# (e.g., retrieval augmented generation, grounding, or other
# real-time taking of content for generative AI search
# answers).
# ai-train: training or fine-tuning AI models.
 
# ANY RESTRICTIONS EXPRESSED VIA CONTENT SIGNALS ARE EXPRESS
# RESERVATIONS OF RIGHTS UNDER ARTICLE 4 OF THE EUROPEAN
# UNION DIRECTIVE 2019/790 ON COPYRIGHT AND RELATED RIGHTS
# IN THE DIGITAL SINGLE MARKET.
 
User-Agent: *
Content-Signal: ai-train=yes, search=yes, ai-input=yes
Allow: /

The key points of this configuration are:

User-agent: * and Allow: / permit all crawlers to crawl the entire site, maintaining the basic openness of the web.
Content-Signal specifies ai-train=yes, search=yes, and ai-input=yes, explicitly allowing search index building, AI input, and AI training.

Why Allow Content Usage?

First and foremost, I want to use my content as context for AI. If my articles are referenced in RAG (Retrieval Augmented Generation) or AI search, I can get more accurate and contextually appropriate answers.

Additionally, having content used for AI training is advantageous for me. If articles from this site become training data for AI, it will lead to more accurate language models in the future and improve productivity for developers, including myself. There is also a practical benefit that AI conversations become smoother when the information I’ve written is reflected in AI.

Of course, unauthorized content reproduction and copyright issues are important topics, but I determined that the benefits to me from being open outweigh those from restricting information usage.

Summary

This robots.txt update is a statement of this site’s stance in an era of coexisting with AI. By allowing AI to utilize my content, I can also benefit from it—that’s the virtuous cycle I’m hoping for.

I would be happy if codenote.net content is utilized as context for AI and as training data, ultimately helping many engineers including myself.

That’s all from the Gemba on updating robots.txt based on Content Signals.

What is Content Signals?

robots.txt Update Details

Why Allow Content Usage?

Summary

References