Posts about codec

Converting latin-1 To utf-8 with Python

Tonight I finally converted all the Glossary pages in my mirror of the Jargon File into Unicode (utf-8 encoding) so that they will transmit and display properly from GitHub Pages (or any other modern web server). It was a fairly trivial thing to do in the end, but I am likely to need to repeat this for other things at work, so I'm blogging it.

The Jargon File was converted into XML-Dockbook and Unicode for version 4.4.0, but ESR only converted the front- and back-matter, not the Glossary entries (i.e. the actual lexicon). Those are still latin-1 (ISO-8859-1). And although the HTML rendition begins with the correct header declaring this:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>

The pages are actually served from catb.org as Unicode (utf-8). For instance, compare /dev/null on catb.org with my mirror of /dev/null.

Read more…