uni2ascii

uni2ascii and ascii2uni provide conversion in both directions between UTF-8 Unicode and more than thirty 7-bit ASCII equivalents, including RFC 2396 URI format and RFC 2045 Quoted Printable format, the representations used in HTML, SGML, XML, OOXML, the Unicode standard, Rich Text Format, POSIX portable charmaps, POSIX locale specifications, and Apache log files. It can also convert between the escapes used for Unicode in languages such as Ada, C, Common Lisp, Java, Pascal, Perl, Postscript, Python, Scheme, and Tcl.

Tags Text Processing Markup General Software Development HTML/XHTML SGML Internationalization Linguistic
Licenses GPL
Operating Systems POSIX
Implementation C Tcl

Tweet this project Short link

Rss Recent releases

  • Rrelease-mid
  •  22 Apr 2009 10:46
  • Rrelease-after

Changes: This release fixes a bug in which uni2ascii gave an incorrect report of the number of characters converted to ASCII.

  • Rrelease-mid
  •  26 Mar 2009 13:45
  • Rrelease-after

    Changes: Both programs now permit the input file name to be specified on the command line without redirection.

    • Rrelease-mid
    •  03 Oct 2008 03:58
    • Rrelease-after

    Changes: This release adds support for the <XX><XX> and %uXXXX formats.

    • Rrelease-mid
    •  31 Aug 2008 07:09
    • Rrelease-after

    Changes: This release fixes a bug that made the Y argument to the -a flag of ascii2uni a no-op, and corrects the man pages and help for the Y and Q arguments to the -a flag for both programs. The Y argument is now an error for uni2ascii. The version information and action summaries are more informative.

    • Rrelease-mid
    •  07 May 2008 03:13
    • Rrelease-after

    Changes: This release fixes a bug that produced bad output or a segmentation fault if a line ended in the prefix to an escape. In quoted-printable format, if a line ends in an equal-sign, both the equal sign and the immediately following newline are now skipped by ascii2uni, in accordance with RFC 2045.

    Rss Recent comments

    Rcomment-before 13 Jan 2006 09:42 Rcomment-trans billposer Rcomment-after

    Re: Recode

    Recode and uni2ascii are complementary. Briefly put, Recode converts from one encoding to another (where the expectation is that the target character set will be the same as, or a superset of, the source character set), whereas Uni2ascii converts between UTF-8 Unicode and ASCII representations of Unicode. In practical terms, Uni2ascii will not convert between, say, ASCII and EBCDIC,
    which Recode will, whereas Recode will not convert between Unicode and the \x{00E9} format, which Uni2ascii will. (I should say that Recode lists but does not explain the encodings that it knows so it is not always easy to figure out what it handles. It is possible that it can handle things that I am not aware of. But at least as far as I can tell, it does not handle the textual representations of Unicode characters that Uni2ascii handles.)

    Thus, if you've got a text in, say, TIS-620 (the Thai national standard) and you want to get it into Unicode, you would use Recode. If you want to include that Thai text in a blog posting using Movable Type, which is not 8-bit safe, you would use Uni2ascii to convert your Unicode version of the Thai text to HTML numeric character references. Similarly, if you wanted to include that Thai text as a string in a program in Java, Python, Scheme, or Tcl, you would use uni2ascii to convert the Unicode to the \uxxxx format.

    My conception of the difference is this. When you have the same character set but different associations between the characters and the integers, conversion between the two is pure encoding conversion. ASCII and EBCDIC are different encodings of the same character set; converting between them is a matter of encoding conversion.
    On the other hand, when you have radically different character sets, conversion from one to the other is a matter of transliteration. Transliteration may be perfect, or nearly so, if both writing systems have been adapted for the same language (e.g. in the case of the roman and cyrillic writing systems for Serbo-croatian) or quite imperfect, (e.g. when Vietnamese is written using only the English alphabet.)

    A third situation is when you use escape sequences to represent the characters of one character set in another.
    That's what we're doing hen we use the sequence of ASCII characters \x{00E9} to represent the Unicode character U+00E9 "Latin small letter e with acute".

    Recode is basically intended to handle encoding conversion. Uni2ascii, on the other hand, is aimed at the third case, the representation of Unicode characters by ASCII escape sequences. Other programs (e.g. my own Xlit) deal with transliteration.

    Of course, the division I've made here, while, I think, the one that people usually make, is not quite so simple, since what are generally thought of as different encodings of the same character set may in fact use somewhat different character sets. For example, decomposed Unicode uses sequences of two or more Unicode characters to represent what in other encodings are single characters. For example, e with acute accent is a single character in ISO-8859-1 (0xE9) but is a two character sequence (0x0065 0x0301) in non-composed Unicode, where it is treated as plain e followed by acute accent. Encoding conversion programs like recode are therefore, in the strict sense, doing more than pure encoding conversion.

    At one level, all of these conversions are the same since they can all be treated as mappings of one set of byte strings to another. However, there is a conceptual difference among them that, with some fuzzy edges, seems to correspond to the functionality of the software designed to handle them.

    Returning to practicalities, Uni2ascii and Recode also provide different approaches to and degrees of control over disparities between character sets, e.g. what to do with characters with diacritics when converting to ASCII.

    Rcomment-before 12 Jan 2006 02:24 Rcomment-trans ed_avis Rcomment-after

    Recode
    How does this compare to GNU recode?

    No-screenshot

    Project Spotlight

    Scam-backscatter

    A milter to prevent backscatter.

    232d7ce07b4a4351f8fd321159c5718b_thumb

    Project Spotlight

    pngcrush

    An optimizer for PNG files that can also insert or delete specified chunks.