The Unicode Utilities are a set of programs for manipulating and analyzing Unicode text. uniname prints any combination of the character offset of each character, its byte offset, its hex code value, its encoding, the glyph itself, and its name. unidesc reports the character ranges to which different portions of the text belong. unihist generates a histogram of the characters in its input. ExplicateUTF8 determines and explains the validity of a sequence of bytes as a UTF-8 encoding. unirev reverses UTF-8 strings. unifuzz tests other programs' unicode handling.
| Tags | Utilities Text Processing Linguistic |
|---|---|
| Implementation | C |
Recent releases


Changes: This release updates character data to Unicode version 5.1 and fixes a bug in the validation option of uniname as well as a couple of other minor bugs.


Changes: This release adds a new utility, unifuzz, which generates test input for programs expecting Unicode. In addition to generating random sequences of characters, unifuzz can generate a character from each range, tokens of various potentially problematic characters and sequences, very long lines, strings with embedded nulls, and ill-formed UTF-8.


Changes: This release adds an option to unidesc that causes it to list the ranges detected after reading all input rather than listing them as they are encountered. uniname now has an option that causes it to ignore characters within the Basic Multilingual Plane.


Changes: This release adds the utility unirev, a filter that reverses UTF-8 strings character-by-character. The package name has changed.


Changes: Uniname and unidesc now report the unofficial ranges within the Private Use Areas registered with the ConScript Unicode Registry.
A modular, multi-threaded, object-oriented, SQL-integrated scripting language.