Web crawler
Web crawler
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.
- Yahoo! Slurp is the name of the Yahoo Search crawler.
- Bingbot is the name of Microsoft's Bing webcrawler. It replaced Msnbot.
- FAST Crawler is a distributed crawler, used by Fast Search & Transfer, and a general description of its architecture is available
- Googlebot is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python.
The crawler was integrated with the indexing process, because text
parsing was done for full-text indexing and also for URL extraction.
There is a URL server that sends lists of URLs to be fetched by several
crawling processes. During parsing, the URLs found were passed to a URL
server that checked if the URL have been previously seen. If not, the
URL was added to the queue of the URL server.
Open-source crawlers
- Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
- DataparkSearch is a crawler and search engine released under the GNU General Public License.
No comments:
Post a Comment