Converting latin-1 To utf-8 with Python

Tonight I finally converted all the Glossary pages in my mirror of the Jargon File into Unicode (utf-8 encoding) so that they will transmit and display properly from GitHub Pages (or any other modern web server). It was a fairly trivial thing to do in the end, but I am likely to need to repeat this for other things at work, so I'm blogging it.

The Jargon File was converted into XML-Dockbook and Unicode for version 4.4.0, but ESR only converted the front- and back-matter, not the Glossary entries (i.e. the actual lexicon). Those are still latin-1 (ISO-8859-1). And although the HTML rendition begins with the correct header declaring this:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>

The pages are actually served from catb.org as Unicode (utf-8). For instance, compare /dev/null on catb.org with my mirror of /dev/null.


UTF-8 does not equal Latin-1

It's a widely held misconception that utf-8 is a superset of ISO-8859-1. It's not. While Unicode itself does contain a Latin-1 Supplement, all of the upper 128 character bit-patterns of ISO-8859-1 have different meaning in utf-8, and there are many legitimate ISO-8859-1 characters that are illegal utf-8 encoding byte sequences. This often confuses me because I'm used to 8-bit characters where the encoding is the same as the code-point: A is 0x41 and means A; ÿ is 0xFF (in latin-1) and means ÿ. With Unicode, the code-points and their encodings are not the same, there are multiple byte-stream encodings for Unicode, and utf-8 is just one of them.

Despite the 8 in it's name, utf-8 is not 8-bit. It's a potentially-multi-byte encoding for Unicode, and it's at least 8 bits. It just happens to share half of the 8-bit space with ISO-8859-1 (which shares the same half with ASCII), and so for English the same characters have the same 8-bit encodings. Most Western-European languages it can be encoded in utf-8 with between 8 and 16-bits as well, and some require 21-bits (so 3 bytes or 4 bytes for utf-8's codec scheme, but only for rare characters).

This is why utf-8 is so popular online:

  • most web sites are in Western-European languages
  • most of these Latin-based characters can be encoded with 8 bits per character, many with up to 16-bits
  • the first 128 characters are the same bit-patterns as 7-bit ASCII, the original byte stream for the APRANET and the Internet

If you look at Japanese or Chinese sites, they prefer utf-16 because this encoding is more efficient for those Unicode blocks: the most common Chinese characters will fit in a 2-byte encoding in utf-16, whereas the same code-points in utf-8 typically need 3 or 4 bytes.

Why ISO-8859-1 “breaks” in utf-8

Because utf-8 is a multi-byte encoding, it reserves some bit patterns for encoding that more than one byte is involved in the current code-point for a character. The last code-point in 8-bit utf-8 is actually 0x7F: the most-significant-bit is reserved to indicate multi-byte. So all of the legitimate ISO-8859-1 8-bit bytes result in different Unicode characters, and some are illegal utf-8 byte sequences.

That's why you can't just take an ISO-8859-1 byte stream (or any other ISO-8859 code page, or the Windows one) and interpret it as utf-8. It only works if the bytes from the stream are in the 7-bit (ASCII) range.

Fixing the Jargon File

When I mirrored the Jargon File back in October 2015, I asked HTTrack to encode as UTF-8, thinking this would fix it. It didn't really succeed, all it did was to add its own encoding header:

<!-- Mirrored from www.catb.org/~esr/jargon/html/A/ABEND.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:15:16 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- /Added by HTTrack -->

The bytes within the page were still ISO-8859-1. So I was having the same issue as catb.org: while the page declares ISO, the web server actually sends utf-8; and now the HTTrack insertion adds its own content to the confusion.

Fixing the Jargon file properly will involve actually going into the Docbook sources and ESRs makefiles and correcting it there. I'm not going to do that: it's too much effort to recover software that understands DocBook XML 1.62 still and can do a lossless conversion. That'll be a "someday" project probably (it's been 15 years since ESR updated the Jargon File himself, and it's dated, and bordering on becoming bogus because of a lack of currency, so it's a low-priority maybe project). Instead what I've done is just run a filter over all the HTML output of the Jargon, since that is what is actually being served most of the time.

Here's the python code, transcode.py:

#!/usr/bin/env python3
import sys
import os
import codecs

sourceFileName = sys.argv[1]
targetFileName = sys.argv[1]+'.tmp'

BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "ISO-8859-1") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents.replace('ISO-8859-1','utf-8'))

os.rename(targetFileName,sourceFileName)

It's a clone and hack, based upon this answer to an SO question on converting files to utf-8 in Python. All I changed was:

  • take the sourceFileName from the script argument (without checking if there is an argument)
  • set a temporary targetFileName based on the source file
  • hardcoded the source encoding
  • and replaced "ISO-8859-1" in the output file with "utf-8" so that the headers match the content (very näive: just looks for the exact string match, so any mention of "ISO-8859-1" within the body will also be replaced)
  • Finally, replace the original source file with the temporary target file, by intentionally clobbering the original with os.rename().

I then ran this code in a shell loop like so:

[src][[email protected]:~/hax/blog/milosophical.me/files/jargon/html]
[22:56](nikola)β for Y in 0 [A-Z]; do
  echo $Y
  for X in $Y/*.html; do
    echo $X
    ~/hax/transcode.py $X
  done
done

And there we have it: all the Jargon lexicon is now encoded as utf-8 and declared as such in the headers. Here's /dev/null:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- Mirrored from www.catb.org/~esr/jargon/html/0/dev-null.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:16:51 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=utf-8" /><!-- /Added by HTTrack -->
<head><title>/dev/null</title><link rel="stylesheet" href="../../jargon.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.61.0"/><link rel="home" href="../index.html" title="The Jargon File"/><link rel="up" href="../0.html" title="0"/><link rel="previous" href="TM.html" title="(TM)"/><link rel="next" href="me.html" title="/me"/></head><body><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">/dev/null</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="TM.html">Prev</a> </td><th width="60%" align="center">0</th><td width="20%" align="right"> <a accesskey="n" href="me.html">Next</a></td></tr></table><hr/></div><dt><a id="dev-null"/><dt xmlns="" id="dev-null"><b>/dev/null</b>: <span xmlns="http://www.w3.org/1999/xhtml" class="pronunciation">/dev·nuhl/</span>, <span xmlns="http://www.w3.org/1999/xhtml" class="grammar">n.</span></dt></dt><dd><p> [from the Unix null device, used as a data sink] A notional
   &#8216;black hole&#8217; in any information space being discussed, used, or
   referred to.  A controversial posting, for example, might end &#8220;<span class="quote">Kudos
   to [email protected], flames to /dev/null</span>&#8221;.  See 
   <a href="../B/bit-bucket.html"><i class="glossterm">bit bucket</i></a>.</p></dd><div class="navfooter"><hr/><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="TM.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="../0.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="me.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">(TM) </td><td width="20%" align="center"><a accesskey="h" href="../index.html">Home</a></td><td width="40%" align="right" valign="top"> /me</td></tr></table></div></body>
<!-- Mirrored from www.catb.org/~esr/jargon/html/0/dev-null.html by HTTrack Website Copier/3.x [XR&CO'2014], Mon, 26 Oct 2015 13:16:51 GMT -->
</html>

Happy Hacking.