Wednesday, August 15, 2012

Web crawler

Web crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.

  • Yahoo! Slurp is the name of the Yahoo Search crawler.
  • Bingbot is the name of Microsoft's Bing webcrawler. It replaced Msnbot.
  • FAST Crawler is a distributed crawler, used by Fast Search & Transfer, and a general description of its architecture is available
  • Googlebot  is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.

Open-source crawlers

  • Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
  • DataparkSearch is a crawler and search engine released under the GNU General Public License.

No comments:

Post a Comment