Resolution Digital Insights & News Safeguarding your website content vs. tapping into AI benefits

Is your website content being used by AI companies?

AI Aware stats 58 percent

Over 1/4 of Australia's top companies allow their content to be used by AI

Generative AI is growing in popularity, so it's essential for marketers to fully understand how AI is using their intellectual property. 


AI web crawlers continue to train their large language models (LLMs) by scanning websites, which can sometimes be misinterpreted.


We crawled 192 of Australia's ASX 200 list of companies and found some interesting insights, such as less than 1% of companies having explicitly blocked AI access.

Is your website blocking AI access?

What is AI Aware?

The AI Aware tool shows you whether your website has any rules in place to block AI access to specific URLs on your website. 


Simply enter your details in the form, and find out.

AI Aware research

We analysed the top 192 of the ASX 200 in Australia to see if their websites are blocking AI access, including ChatGPT. Here are the results:
AI Aware stats 1 percent

Less than 1% have explicitly blocked AI access

AI Aware stats 49 percent

25.5% have allowed full AI access

AI Aware stats 58 percent

58.4% have allowed partial AI access

AI Aware stats 16 percent

15.6% have specified no rules for AI access

FAQs

Websites often use a robots.txt file to specify which parts of their site should not be crawled by web crawlers, including AI-based ones. Additionally, they may implement rate limiting, CAPTCHA challenges, or IP blocking to deter automated access. Read our blog to learn more about How to Block AI Content Scraping.
Generally, it is legal for website owners to control access to their content using methods like robots.txt or access restrictions. However, the legality may vary, and websites must comply with applicable laws, such as data protection regulations.
Some AI bots may attempt to bypass blocking measures, but doing so may violate a website's terms of service and potentially lead to legal consequences. Ethical AI should respect a website's access policies.
OpenAI uses web pages crawled by GPTBot to potentially improve its future AI models. These pages undergo filtering to remove, personally identifiable information (PII), or text violating OpenAI's policies. Allowing GPTBot access to your site can contribute to enhancing the accuracy, capabilities, and safety of AI models.

Blocking GPTBot is a step towards enabling internet users to opt out of having their data used for training large language models. However, it does not retroactively remove previously scraped content from ChatGPT's training data. 

How to opt Out of ChatGPT Scraped Data

AI chatbots such as GPT-3 undergo training using a combination of datasets, some of which are publicly available. In the case of GPT-3, it was trained on five datasets with varying weights:
  • Common Crawl (60% weight in training): This dataset comprises vast amounts of web data collected since 2008, similar to how Google's search algorithm crawls web content.
  • WebText2 (22% weight in training): WebText2 is a dataset created by OpenAI, consisting of approximately 45 million web pages linked to from Reddit posts with at least three upvotes.
  • Books1 (8% weight in training): Books1 is one of the datasets used in the training process, contributing to the model's knowledge base.
  • Books2 (8% weight in training): Similar to Books1, Books2 also played a role in training GPT-3 and adding to its knowledge.
  • Wikipedia (3% weight in training): Wikipedia data was included in the training process, though with a lower weight compared to the other datasets.

Our Offices

Sydney.

Gadigal Country
Bay 7, 2 Locomotive Street
South Eveleigh, NSW, 2015

Visit your Sydney office

Melbourne.

Wurundjeri Country
Level 6, 650 Chapel Street
South Yarra, VIC, 3141

Visit your Melbourne office

Brisbane.

Turrbal and Jagera Country
200 Adelaide Street
Brisbane City, QLD, 4000

Visit your Brisbane office
l

Contact Us