Tonight I finally converted all the Glossary pages in my mirror of the [Jargon File](http://catb.org/~esr/jargon) into Unicode (utf-8 encoding) so that they will transmit and display properly from GitHub Pages (or any other modern web server). It was a fairly trivial thing to do in the end, but I am likely to need to repeat this for other things at work, so I'm blogging it. The Jargon File was [converted into XML-Dockbook and Unicode](/jargon/news.html) for version 4.4.0, but ESR only converted the front- and back-matter, not the Glossary entries (i.e. the actual lexicon). Those are still *latin-1* ([ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)). And although the HTML rendition begins with the correct header declaring this: ```html ``` The pages are actually served from catb.org as [Unicode](https://en.wikipedia.org/wiki/Unicode) ([utf-8](https://en.wikipedia.org/wiki/UTF-8)). For instance, compare [/dev/null on catb.org](http://www.catb.org/~esr/jargon/html/0/dev-null.html) with [my mirror of /dev/null](/jargon/html/0/dev-null.html). ---- # UTF-8 does not equal Latin-1 It's a widely held misconception that utf-8 is a superset of ISO-8859-1. It's not. While *Unicode* itself *does* contain a [Latin-1 Supplement](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)), *all* of the upper 128 character bit-patterns of ISO-8859-1 have *different meaning in utf-8*, and there are many legitimate ISO-8859-1 characters that are *illegal utf-8 encoding byte sequences*. This often confuses me because I'm used to 8-bit characters where the encoding is the same as the code-point: `A` is `0x41` and means *A*; `ÿ` is `0xFF` (in latin-1) and means *ÿ*. With Unicode, the code-points and their encodings are *not* the same, there are multiple byte-stream encodings for Unicode, and utf-8 is just one of them. Despite the *8* in it's name, utf-8 is not 8-bit. It's a *potentially-multi-byte* encoding for Unicode, and it's *at least 8 bits*. It just happens to share *half* of the 8-bit space with ISO-8859-1 (which shares the same half with ASCII), and so for English the same characters have the same 8-bit encodings. *Most* Western-European languages it can be encoded in utf-8 with between 8 and 16-bits as well, and some require 21-bits (so 3 bytes or 4 bytes for utf-8's codec scheme, but only for rare characters). This is why utf-8 is so popular online: * most web sites are in Western-European languages * most of these Latin-based characters can be encoded with 8 bits per character, many with up to 16-bits * the first 128 characters are the same bit-patterns as 7-bit [ASCII](https://en.wikipedia.org/wiki/UTF-16), the original byte stream for the APRANET and the Internet If you look at Japanese or Chinese sites, they prefer [utf-16](https://en.wikipedia.org/wiki/UTF-16) because this encoding is more efficient for those Unicode blocks: the most common Chinese characters will fit in a 2-byte encoding in utf-16, whereas the same code-points in utf-8 typically need 3 or 4 bytes. ## Why ISO-8859-1 “breaks” in utf-8 Because utf-8 is a multi-byte encoding, it reserves some bit patterns for encoding that more than one byte is involved in the current code-point for a character. The last code-point in 8-bit utf-8 is actually `0x7F`: the most-significant-bit is reserved to indicate multi-byte. So all of the legitimate ISO-8859-1 8-bit bytes result in *different* Unicode characters, and some are *illegal utf-8 byte sequences*. That's why you can't just take an ISO-8859-1 byte stream (or any other ISO-8859 code page, or the Windows one) and interpret it as utf-8. It only works if the bytes from the stream are in the 7-bit (ASCII) range. # Fixing the Jargon File When I mirrored the Jargon File back in October 2015, I asked [HTTrack](https://www.httrack.com) to encode as UTF-8, thinking this would fix it. It didn't really succeed, all it did was to add its own encoding header: ```html ``` The bytes within the page were still ISO-8859-1. So I was having the same issue as catb.org: while the page declares ISO, the web server actually sends utf-8; and now the HTTrack insertion adds its own `content` to the confusion. Fixing the Jargon file [properly](/jargon/mirroring.html) will involve actually going into the [Docbook sources](/jargon/jargsrc.tar.gz) and ESRs makefiles and correcting it there. I'm not going to do that: it's too much effort to recover software that understands [DocBook](http://docbook.sourceforge.net/) XML [1.62](https://sourceforge.net/projects/docbook/files/OldFiles/) still and can do a lossless conversion. That'll be a "someday" project probably (it's been **15 years** since ESR updated the Jargon File himself, and it's dated, and bordering on becoming [bogus](/jargon/jargtxt.html) because of a lack of currency, so it's a low-priority *maybe* project). Instead what I've done is just run a filter over all the HTML output of the Jargon, since that is what is actually being served most of the time. Here's the python code, `transcode.py`: ```python #!/usr/bin/env python3 import sys import os import codecs sourceFileName = sys.argv[1] targetFileName = sys.argv[1]+'.tmp' BLOCKSIZE = 1048576 # or some other, desired size in bytes with codecs.open(sourceFileName, "r", "ISO-8859-1") as sourceFile: with codecs.open(targetFileName, "w", "utf-8") as targetFile: while True: contents = sourceFile.read(BLOCKSIZE) if not contents: break targetFile.write(contents.replace('ISO-8859-1','utf-8')) os.rename(targetFileName,sourceFileName) ``` It's a [clone and hack](/jargon/html/C/clone-and-hack-coding.html), based upon [this answer to an SO question](https://stackoverflow.com/a/191403/776953) on converting files to utf-8 in Python. All I changed was: * take the `sourceFileName` from the script argument (without checking if there *is* an argument) * set a temporary `targetFileName` based on the source file * [hardcoded](/jargon/html/H/hardcoded.html) the source encoding * and replaced "`ISO-8859-1`" in the output file with "`utf-8`" so that the headers match the content (very näive: just looks for the exact string match, so any mention of "ISO-8859-1" within the body will also be replaced) * Finally, replace the original source file with the temporary target file, by intentionally [clobber](http://milosophical.me/jargon/html/C/clobber.html)ing the original with `os.rename()`. I then ran this code in a shell loop like so: ```sh [src][mjl@milo:~/hax/blog/milosophical.me/files/jargon/html] [22:56](nikola)β for Y in 0 [A-Z]; do echo $Y for X in $Y/*.html; do echo $X ~/hax/transcode.py $X done done ``` And there we have it: all the Jargon lexicon is now encoded as utf-8 and declared as such in the headers. Here's [/dev/null](/jargon/html/0/dev-null.html): ```html /dev/null
/dev/null: /dev·nuhl/, n.

[from the Unix null device, used as a data sink] A notional ‘black hole’ in any information space being discussed, used, or referred to. A controversial posting, for example, might end “Kudos to rasputin@kremlin.org, flames to /dev/null”. See bit bucket.

``` Happy [Hacking](/jargon/html/H/hack.html).