Crawlability refers to the ability of search engine bots (or crawlers) to access and navigate the content of a website. If a page is crawlable, it means that search engines can find it and read its content.
Example: Imagine you have a website with several pages, but you accidentally set some pages to "noindex"
in your robots.txt file
. Search engines won't be able to crawl these pages
, meaning they won't appear in search results.
Indexability is the ability of a web page to be added to a search engine's index
after it has been crawled. If a page is indexable, it means that it can appear in search results for relevant queries.
Example: Even if a page is crawlable, it might not be indexable if you use meta tags like <meta name="robots" content="noindex">
, which tells search engines not to include the page in their index.
The robots.txt
file is used to give instructions to search engine crawlers about which pages or sections of your site should not be crawled. It can prevent search engines from accessing sensitive or irrelevant parts of your site.
User-agent: * Disallow: /private/ Disallow: /temp/
This tells all search engines not to crawl the /private/ and /temp/
directories.
The sitemap.xml
file lists the URLs of a website that are available for crawling.
It helps search engines
discover and prioritize pages for crawling, especially if your site has a complex structure or new pages that might not be easily found through regular crawling.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2024-07-28</lastmod> <priority>1.0</priority> </url> <url> <loc>http://www.example.com/about</loc> <lastmod>2024-07-28</lastmod> <priority>0.8</priority> </url> </urlset>
This example shows a sitemap with two URLs, prioritizing the homepage.