cpdetector is a small yet clever framework for codepage detection that integrates different strategies. It may be used as a library for third party software that accesses textual data over network. It also includes a best-practice implementation in form of a command line tool that allows sorting and transforming large collections of documents based on their codepage. Available strategies include: jchardet (exclusion, frequency analysis, and guessing), detection of the HTML charset property, and detection of the XML encoding declaration.
| Tags | Communications Information Management Internet Web Indexing/Search Software Development Internationalization Libraries Java Libraries |
|---|---|
| Licenses | MPL |
| Implementation | Java |
Recent releases


Changes: The release structure has been changed: cpdetetor.jar does not contain 3rd party library files anymore. Missing public functions are contained again. The proguard shrinker has been updated from version 3.8 to 4.2.


Changes: The proguard shrinker is now used, so the cpdetector jar is now more than ten times smaller. System.out is no longer used for logging in JChardetFacade. All packages were renamed with the prefix "info.monitorenter".


Changes: Severe errors like a potential infinite loop and incorrect file handling have been removed.


Changes: A bug in the Ant build of the source release has been fixed. Instructions for document tests with fit were added.


Changes: It is now possible let cpdetector guess the codepage out of the remaining possibilities when it is not possible to narrow down this set to one. This version marks the start of testing with FIT. A new best practice command line tool allows printing of the codepage name for file arguments.