crawl

Crawl starts a depth-first traversal of the Web at the specified URLs. It stores all JPEG images that match the configured constraints. Crawl is fairly fast and allows for graceful termination. After terminating crawl, it is possible to restart it at exactly the same spot where it was terminated. It also keeps a persistent database that allows multiple crawls without revisiting sites.

Tags Internet Web Indexing/Search
Licenses BSD Original
Operating Systems POSIX

Tweet this project Short link

Rss Recent releases

  • Rrelease-mid
  •  17 May 2003 19:42
  • Rrelease-after

Changes: Crawling is more reliable, and crawl performance has improved for large crawls.

  • Rrelease-mid
  •  28 Jan 2002 22:47
  • Rrelease-after

Changes: A complete rewrite for higher performance, including support for all media types, asynchronous DNS lookups, and optional wait time between hosts. The configuration file specifies permittable size of media objects depending on media-type, etc.

  • Rrelease-mid
  •  12 Dec 2001 21:36
  • Rrelease-after

Changes: This release has portability fixes, and supports downloading of media types other than just images.

  • Rrelease-mid
  •  03 Jul 2001 16:18
  • Rrelease-after

Changes: This release fixes a bug where crawl would stop early when encountering errors on subsequent connections. The verbosity level and number of concurrent connections are tunable now

  • Rrelease-mid
  •  18 Jun 2001 16:09
  • Rrelease-after

No changes have been submitted for this release.

F1ec9df527d6f8cb6d31c0511f64ebbd_thumb

Project Spotlight

KontrolPack

A remote shell command executor.

No-screenshot

Project Spotlight

FindRepe

A duplicate file finder.