Projects / Combine

Combine

Combine is an open and extensible system for crawling Internet resources, including harvesting and indexing. It can be used both as a general and focused crawler. Integration with database systems are provided in order to make complete vertical search engine generation possible.

Tags Database Internet Web Indexing/Search Z39.50
Operating Systems Unix
Implementation Perl

Tweet this project Short link

Rss Recent releases

  • Rrelease-mid
  •  16 Jun 2009 08:22
  • Rrelease-after

    Changes: Better handling of special charatcters, better HTML to text extraction, support for new URL scheduling algorithms including score based algorithms, and support for exceptions to GeoIP. Some tests were fixed.

    • Rrelease-mid
    •  09 Dec 2008 23:32
    • Rrelease-after

    Changes: This release is integrated with the Solr enterprise search server, and can feed records directly to a Solr server. There is also a new version numbering system that is compatible with CPAN requirements.

    • Rrelease-mid
    •  18 Nov 2008 21:33
    • Rrelease-after

    Changes: Code for simple Lucene integration has been added to the templates directory. The documentation HTML generator has been changed to use ht4tex.

    • Rrelease-mid
    •  13 Nov 2008 19:07
    • Rrelease-after

    Changes: This release adds the switch ZebraIndexing to combineExport. It enables updating of the configured Zebra server with exported records. It fixes a bug in Zebra recordId handling. It adds the switches 'collapseinlinks' and 'nooutlinks' to combineExport. It improves indexing of PDF documents. It fixes a bug in the processing of pure text documents.

    • Rrelease-mid
    •  15 Oct 2008 09:52
    • Rrelease-after

    Changes: A fulltext-index was added in MySQL table search, as was a configuration variable to enable or disable it. Integration with the Zebra database system was fixed. Updates, fixes, and code cleaning were done. Support for SVM classifiers was added (which depends on SVMLight). Country determination was added (adding a dependency on GeoIp). Two new PlugIn types were added: "relevant text extraction" and "extra analysis".

    No-screenshot

    Project Spotlight

    xorriso

    An ISO 9660 multi-session CD/DVD/BD filesystem manipulator.

    No-screenshot

    Project Spotlight

    abcMIDI

    ABC MIDI conversion utilities.