HarvestMan is a multithreaded off-line browser.It has many features for customizing offline browsing through URL filters, word filters, domain filters, URL priorities, depth-fetching, fetch levels, file limits, time limits, robot exclusion protocols, and many more. It is useful to download an entire Web site or certain files from a Web site to the hard disk for offline browsing later. It supports HTTP/HTTPS and FTP protocols and can work across proxies.
| Tags | Internet Web Browsers Indexing/Search Software Development Libraries Python Modules Utilities |
|---|---|
| Licenses | GPL |
| Operating Systems | OS Independent |
| Implementation | Python |
Recent releases


Changes: The install scripts were fixed. They had problems working with Python 2.4.


Changes: This release fixes a bug in the regular expression for localizing URLs, a bug related to resuming a project by reading back its project file, and errors with a few commandline options that were not working correctly. It adds a subdomain flag to the commandline.


Changes: New, user friendly command line options, a new nocrawl command line flag for only downloading URLs, similar to wget, support for .chm, .cfm, .cfml, .php4, and .aspx Web page extensions, and a duplicate link bugfix for the URL tree printing option. Other minor bugfixes were made and readme.txt was updated.


Changes: This release replaces lists at critical places with the new collections.deque data structure. This improves performance when run with Python 2.4. 2. A bug with HTTP redirect handling that requires cookies has been fixed. Many bugs that created invalid URL (HTTP 404) errors have been fixed. The modules htmlparser and cookiemgr have been removed, since they are no longer used. The default locale has been changed to 'C'. Bugs in the logger.py, connector.py, and config.py modules have been fixed.


Changes: The config file format has been changed from text to XML. There is a new HTML parser based on the SGMLParser module. The dependency on HTML tidy is removed. A new archive feature for archiving project files to tar.bz2/tar.gz archives. Changes in project caching: data of Web pages is compressed before writing to cache, there is an option for writing the cache in DBM format, and headers of URLs are also written to the cache. A junk filter for filtering out banner ads and similar URLs. This release works with Python 2.4.