Most websites have a "root" or "home page," which is typically named index.html and which is the default page a user sees when visiting a site. Typing http://www.dlib.org into the browser address bar is in fact a request for the "root" document of the site, which in our example is http://www.dlib.org/index.html, labeled "web root" in Figure 1. From that point on, site discovery depends on following the links on that root page to other pages on the site. Note that in our Figure 1 example, the site has specifically excluded certain pages via the "robots.txt" protocol. Polite robots like the Big Three respect such requests, even though those pages are visible on the web to regular users. Other pages have access restrictions (shown in red) that prevent any unauthorized access. An example of this is the ".htaccess" file, which requires a UID-Password combination to view those resources. Some pages such as a credit card charge page might only be generated on-the-fly, that is, through user-interaction with the website. Typically, crawlers do not activate such links and therefore do not discover those pages.
Web crawlers also generally start their crawls at the site's web root, in part because this is where a site's main links can usually be found. The accessible pages of a site are found by the crawler as it extracts the links from each page that it visits, adding these to the list of pages for the site and visiting each of those in turn. Once all unique links for a site have been collected and visited, the crawler is done. Of course, pages can exist without having a link to them on the site. In that case, a crawler will not be able to find out about that page unless the page is listed at some other site, or if it is included in a sitemap submitted to the search engine. Occasionally, links are "bad" – they point to a page that either no longer exists or that has been renamed. Users may "guess" at what the new location is, or they might use the website's search feature to find it. For a crawler, a bad link leads nowhere. The basic rule holds: crawlers can only visit pages that they know about and that actually exist.
Many websites use the root page as an entry point to other sections of their own webs. D-Lib Magazine's root page contains links to Back Issues, to the D-Lib Forum, and to many other areas of the website in addition to links to the current month's articles. Search engine crawlers may initially visit just the root page to collect the list of main links and then pass each of these main links to a series of crawlers. The overall demand on the web server made by lots of simultaneous robot requests can impact its performance, but most search engine crawlers have worked out a reasonable compromise between efficient crawling and server responsiveness.