The charset within the machine

Videoblogging theoretics, being the media, and the completely improvised future of a world currently without rhyme, reason or good beetroot fertiliser.

9th February 2005

Permalink • # • Comment count

Postscript was invented by Adobe, and released way back in 1985. As a printed page description language, most of Postscript’s income was derived from license fees from OEMs who had to embed Postscript in their printers. For years the exceptional quality of Postscript was only available in these subsequently expensive printers, keeping them out of the hands of home users. In many ways, Adobe’s Postscript licensing fees delayed the onset of desktop publishing by almost a decade years. This left home users with various proprietary character coded control sequences depending on which printer they owned. The Epson C-Itoh codes and subsequently in 1980 HP’s PCL provided at least some commonality, but what we really wanted was Postscript.

Fast forward to 1993, with Postscript licensing fees yet to drop to home market levels, Adobe announces “Portable Document Format” (PDF), or what’s more likely “Postscript Document Format”. PDF is an extended or superset version of Postscript, which includes additional support for screen display and contextual cues for text manipulation. It also provides compression, a necessity considering Postscript’s verbose textual syntax and with typical Internet connections at that time being fairly low bandwidth. Newer versions of PDF have seen a move more toward screen based hypermedia, with a facility called Tagged PDF, a method for embedding HTML like structure in a PDF, and having the PDF reader make all the rendering decisions. Somwhat ironically the opposite direction to that of web browsers, who have been trying to get away from generic rendering virtually since they first appeared.

PDF uses two encoding formats for text, the built in PDFDocEncoding, and Unicode encoded as UTF-16BE. UTF-16BE is the easiest, assuming you have an incoming Unicode encoding, because it is mostly a direct serial encoding of the 16 bit Unicode values. But if you haven’t got a UTF-16BE stream available, then PDFDocEncoding is what you’ll be using.

With PDFDocEncoding, you need to understand what character codes you’re going to be using, so you can map them to the appropriate Postscript and font glyphs.

For example, in Code Page 1252 encoding (CP1252), the value hex 97 is an em dash. CP1252 is an informal Microsoft Windows extended version of the standard ISO-8859-x character encoding, the latter not including en or em dashes and several other significant typesetting glyphs. The Unicode representation of an em dash is U+2014, but Unicode is just a numbering system, and to represent that in memory, you need to pick an encoding. In UTF-16BE (Big Endian), it maps directly to the two byte sequence hex 20 14, but in UTF-8, it gets coded as the three byte sequence hex e2 80 94.

PDF doesn’t provide a way to convert UTF-8 codes to glyphs (e.g. e2 80 94 = emdash), so if not using UTF-16BE, you need to convert any UTF-8 encoding to another single byte encoding, and provide an appropriate glyph mapping table inside the PDF.

Encoding	Code
CP1252	hex 97
ISO-8859-1	n/a
Unicode	U+2014
UTF-8	hex e2 80 94
UTF-16BE	hex 20 14
UTF-16LE	hex 24 20

In a *nux environment, the iconv (in typically UNIX fashion, shortened from uniconvert) program will convert between a whole range of encodings, but writing your own is also simple to do for single byte encodings, such as code pages and other ASCII extended formats. I’m not going to talk about the specifics of character encoding and Unicode, because there’s enough information out there already, suffice to say that code pages in various forms have been around since the early 1970s, and were used in most micros and mainframes to represent glyphs on a screen. More recently, Microsoft seems to have owned the idea due to the MS-DOS legacy.

At Synop, our Sytadel platform uses HTMLDOC to generate PDFs from microcontent encoded as UTF-8, but HTMLDOC doesn’t yet support UTF-16BE, so we pre-convert our incoming UTF-8 data to CP1252, before passing it to HTMLDOC to remap inside the PDF. The roundtrip is interesting, because the data was originally stored in a Microsoft Access database, and was converted from CP1252 to UTF-8 when it was exported for use in Sytadel.

Conversion of character encodings tends to be lossy, depending on the scope of each encoding, but considering the scope of Unicode, and our originating data being encoded as the fairly limited CP1252, this isn’t a problem we’ve really had to deal with.

Character set encodings can be a tricky beast if you don’t know how they work, as most tend to also embed a common 7 bit ASCII encoding, which can hide some of the more obscure character encoding problems. But if you know the background, character set encodings are in fact like most technologies, quite simple to understand and master.

Richard BF

Works in progress

Older Projects

Misc.

Old

Richard BF

Categories

One Comment

Add a comment

Richard BF

Works in progress

Older Projects

Misc.

Old

Richard BF

Categories

Tags

One Comment

Add a comment