How Do Software Spiders Work?


Google sometimes will reference a site without actually visiting the page. (Google knows that because several sites link to a page, that page exist). This is a simplistic view of what happens from 60,000 feet above the action. The actual process that converts the site from a Web page to an entry on a results page is a highly sophisticated data warehousing and information retrieval scheme, which will vary from engine to engine. In fact, it is this process of retrieving documents from a database that is one key point of differentiation for most search engines. The other point of differentiation lies around the other services offered and partnerships formed by each engine.

Due to the sheer volume and size of the documents indexed, each search engine has developed its own algorithm for which pieces of data are stored and methodologies for compression that allow for rapid searching and more economical storage of huge volumes of data.

By following each Web site’s navigation, a search engine’s spider can often read and index the entire text of each site it visits into the engine’s main database. Many search engines have begun to limit the depth they will crawl in the site and the number of pages they will index. For any single site, this limit is often about 400 to 500 pages. For owners of very large sites, these limits can present a significant challenge. Apparently, the reason for these limitations is that the Web, with billions of documents, has become so large that it is simply unfeasible to crawl and index everything.

Even the largest search, no matter how much and how deeply they crawl, can hardly expect to index but a portion of the Web. These spiders have their work cut out for them if they are to look for new documents, let alone visit previously indexed documents to check for dead links and revised pages. Regardless search engines that populate their databases using spider technology can grow very large in relatively short periods of time The number of pages a spider will index is unpredictable and how long your pages will remain active once they have been indexed is equally unpredictable. You should consider submitting a site map to ensure that the spider will get the major sections of your site. In addition, include your most important keywords on the top-level page leading off your site map.

What the spider sees on your site will determine how your site is listed in the search engine’s index. Each search engine determines a site’s relevancy based on a complex scoring system, or algorithm, which the engines try to keep secret. These algorithms are the core proprietary technology of the search engines. Each system adds or subtracts points on criteria such as how many times a keyword appeared on the page, where on the page it appeared, how many total words were found, and the ratio of keywords to content. The pages that achieve the most points are returned at the top of the search results, and the rest are buried at the bottom, never to be found.


Source by Pamela Upshur