Understanding Applebot: Apple’s Web Crawler

How to stop Apple Intelligence scraping your data

Mark Craddock
Prompt Engineering

--

Applebot is the web crawler developed by Apple to power various search and indexing features across its ecosystem, including Spotlight, Siri, and Safari. By allowing Applebot to access your website through robots.txt, your content can appear in search results for Apple users worldwide, enhancing the visibility and reach of your site.

What Does Applebot Access?

Applebot can crawl a wide range of resources from web servers, including:

  • robots.txt
  • sitemaps
  • RSS feeds
  • HTML documents
  • Sub-resources needed to render pages such as JavaScript, Ajax requests, and images.

Identifying Applebot

Applebot can be identified through reverse DNS lookups in the *.applebot.apple.com domain. Additionally, its IP addresses can be matched with a CIDR prefix found in a specific JSON file provided by Apple. Here’s an example of using the host command to verify Applebot’s identity:

$ host 17.58.101.179
179.101.58.17.in-addr.arpa domain name pointer 17-58-101-179.applebot.apple.com

$ host 17-58-101-179.applebot.apple.com
17-58-101-179.applebot.apple.com has address 17.58.101.179..

User Agents

User agents help webmasters identify and manage crawler traffic. Applebot uses different user agents for search and podcast indexing.

The user-agent string contains ”Applebot” and other information.:

  • Search:
  • For desktop:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)
  • For mobile:
Mozilla/5.0 (iPhone; CPU iPhone OS 17_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Mobile/15E148 Safari/604.1 (Applebot/0.1; +http://www.apple.com/go/applebot)
  • Podcasts:
  • iTMS traffic is identified by:
User-Agent: iTMS

The iTMS user agent does not follow robots.txt rules as it only crawls URLs associated with registered Apple Podcasts content.

Customising robots.txt Rules

Applebot adheres to standard robots.txt directives. Here’s an example of a robots.txt file that restricts Applebot’s access to certain directories:

User-agent: Applebot
Allow: /
Disallow: /private/

User-agent: *
Disallow: /not-allowed/

If Applebot-specific instructions are not provided, it will follow the directives for Googlebot.

Rendering and Robot Rules

For Applebot to index your website effectively, ensure all resources needed to render the page are accessible. Blocking resources like JavaScript and CSS might prevent proper rendering. Ensure your site performs well even if some resources are unavailable, a practice known as graceful degradation.

Customising Indexing Rules for Applebot

Applebot supports robots meta tags in HTML documents to control indexing. These meta tags should be placed in the <head> section:

<html><head>
<meta name="robots" content="noindex, nosnippet"/>
...
</head>
<body>...</body>
</html>

Available directives include:

  • noindex: Do not index this page.
  • nosnippet: Do not generate a snippet for this page.
  • nofollow: Do not follow any links on this page.
  • none: Do not index, snippet, or follow links.
  • all: Index, snippet, and follow links as usual.

Controlling Data Usage by Apple’s AI models

Apple offers an additional user agent, Applebot-Extended, which provides web publishers control over how their content is used for training Apple’s AI models. To opt-out of this, add the following rule to robots.txt:

User-agent: Applebot-Extended
Disallow: /private/

Applebot-Extended does not crawl pages but helps determine data usage for AI training.

Search Rankings

Apple Search ranks web results based on:

  • User engagement
  • Relevancy to search terms
  • Quality and number of links
  • User location signals
  • Webpage design

These factors collectively influence search results without predetermined importance.

Conclusion

Understanding and configuring Applebot correctly can significantly enhance your website’s visibility within Apple’s ecosystem. Properly managing robots.txt and meta tags ensures that your content is indexed efficiently and appropriately by Applebot.

--

--

Mark Craddock
Prompt Engineering

Techie. Built VH1, G-Cloud, Unified Patent Court, UN Global Platform. Saved UK Economy £12Bn. Now building AI stuff #datascout #promptengineer #MLOps #DataOps