Posts about character-encoding

Control Characters

Have you ever wondered about the Unix terminal control characters: Ctl-C to interrupt a program, or Ctl-D to close the terminal? Do you know what Ctl-L does? Why does your computer beep if you press Ctl-G, but not other control keys?

The answer is that by holding down the Control key and typing a letter, you're sending a “control character” to the computer.

But what are these control characters, and why is there a seemingly random association between letters of the alphabet, and functions of the terminal?

Read more…

Converting latin-1 To utf-8 with Python

Tonight I finally converted all the Glossary pages in my mirror of the Jargon File into Unicode (utf-8 encoding) so that they will transmit and display properly from GitHub Pages (or any other modern web server). It was a fairly trivial thing to do in the end, but I am likely to need to repeat this for other things at work, so I'm blogging it.

The Jargon File was converted into XML-Dockbook and Unicode for version 4.4.0, but ESR only converted the front- and back-matter, not the Glossary entries (i.e. the actual lexicon). Those are still latin-1 (ISO-8859-1). And although the HTML rendition begins with the correct header declaring this:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>

The pages are actually served from catb.org as Unicode (utf-8). For instance, compare /dev/null on catb.org with my mirror of /dev/null.

Read more…