webbase is an Internet crawler. It is able to crawl and maintain millions of URLs and store information about them in a MySQL database. The interface is either a command line program or a C library. It contains hooks to plug a full text indexing database.
| Tags | Software Development |
|---|---|
| Licenses | GPL |
Recent releases


Changes: The exploration will now stop after or before loading an URL only - touch to force loading even if the content of the URL is in the database. Fixed index updating bug that removed documents from the index when they are found Not Modified by the crawler. Upgraded md5 code to GPL, and made other small utilities fixes.


Changes: -version now shows the version number. An allocation error when updating the full text index and a name server timeout condition handling optimization have been fixed. /etc/my.cnf, ~/.my.cnf, and datadir/my.cnf are now used instead of ~/.my.cnf alone.


Changes: Implementation of dynamic updating of the fulltext index, and fixes for a last modified time update bug, a mysql-3.23.19a-gamma namespace conflict, and a bug that left the start point in virgin state artificially.


Changes: A -crawlers option to run simultaneous crawlers and a signal handling function for graceful interuption of the crawlers, and the ability for url, url_complete, and url_content tables to grow over 4GB. The hook library is dynamically loadable with the -hook option so that specific full indexing strategies can be implemented as plugins. The -where_url option is taken in account when rebuilding the full text index with -rebuild. Extensions and MIME types have been added to the list of known MIME types. The auth field of the start table was removed because it was not used.


Changes: The crawler manual page was completely reviewed for correctness. Bug fixes were made in the mifluz interface. The -agent option was implemented. The -show option family was added to display all URL information from an exploration starting point. The configuration script was improved. Major leaks and concurency problems were fixed in the langrec interface. The scope of the allow/disallow comparison was widened to include CGI parameters. Code to use .my.cnf files (if any) was restored.