Control Characters

Have you ever wondered about the Unix terminal control characters: Ctl-C to interrupt a program, or Ctl-D to close the terminal? Do you know what Ctl-L does? Why does your computer beep if you press Ctl-G, but not other control keys?

The answer is that by holding down the Control key and typing a letter, you're sending a “control character” to the computer.

But what are these control characters, and why is there a seemingly random association between letters of the alphabet, and functions of the terminal?

First, the quick answer: remember that everything inside a computer is stored as numbers, even letters. When you type a key to the computer with the Control key held down, what happens is that the numeric encoding of the letter you type first has 64 (0x40) taken away from it, and then that is sent. Why this value? It's the most-significant-bit (MSB) — the 7th bit — of the US-ASCII teletype character code table.

Looking at an old 7-bit ASCII table, we can see what this will do, but the arrangement of the table is important for making this clear. Usually the table is layed out in a list fashion, which doesn't expose what I'm about to point out. Take a look at this table which is a 16 × 8 grid of code points:


To read this table, you use the row and then the column to work out what the encoding is for each character. For instance the letter A is character A, located in the ASCII table at row 4, column 1, encoded as 0x41 in hexadecimal, or 100 0001 in binary (it's also 65 decimal, though the decimal value is more confusing than helpful).

Some facts about character encodings in ASCII begin to make some sense when the table is arranged this way:

  • The first two rows of 0x20 characters (32 characters) are used to encode terminal control commands, in-line of the terminal data stream
  • There is a simple relation between the number-characters and their character encodings: set the three highest bits of the character encoding to 000 (by subtracting 0x30) and you have their numeric value, in binary, from the column number. For example ASCII character 1 (0x31) is the numeric value 0x31 - 0x30 = 0x01 = 1, ASCII 2's value is 0x02, and so on
  • Similarly, each letter's position in the alphabet can be calculated by subtracting 0x40 from the ASCII code: A = 0x41 and is the 0x41-0x40 = 0x1st letter of the alphabet; P = 0x50 and is 0x50 - 0x40 = 0x10th (that is, 16th), and so on
  • There is a simple relation between a capital letter and its miniscule pair: A is two rows above a in the table, which is a difference of 0x20. This means that implementing a Shift key on a terminal could be just a matter of clearing the 6th bit, for instance shift from 110 0001 for a to 100 0001 for A
  • Extending this, you can see a simple relation between the letters on rows 4 and 5 and the controls on rows 0 and 1: you can produce the control characters by clearing the 7th bit with the Control key, which is like pressing Shift to clear the 6th bit

The ASCII code, despite its limitations so far as language support goes, is a very elegant code, with a lot of clever features, a bit like the periodic table of the elements.

Control characters

So, lets look briefly at the control characters I teased about in the introduction.

  • Pressing Ctl-C sends 000 0011 instead of 100 0011. This is the encoding for ETX (End of Text) and marks the end of a record, or acts as a Break or Interrupt signal
  • Pressing Ctl-D sends 000 0100 which is EOT (End of Transmission), and ends the communication on a terminal, and so also hangs up the session
  • Pressing Ctl-L sends 000 1100 which is FF (Form Feed). On an old teletype paper terminal, this feeds the paper through to the next page. A video terminal clears the screen
  • And the Ctl-G? That sends 000 0111 which is BEL, to ring the bell on an old teletype. Modern terminals, lacking a bell, make a beep instead

Now it is also clear why viewing a Windows or DOS text file on a Unix computer (or adding it to git without normalising line-endings) will show ^M on the end of each line: the CR (Carriage Return) is unprintable, so instead the terminal displays a caret ^ and then the character that results from flipping the 7th bit back on (000 1101 becomes 100 1101, which is M). Viewing a Unix text file on Windows looks strange as well, because these files only contain the LF (Line Feed) characters: Windows is technically correct not to automatically return the virtual carriage since there is no CR.

Photo of an amber DEC VT-320 video terminal

I recall using old Digital VT-320 video terminals at the UTas computer lab back in 1993. Misconfigured terminals would often show ^H when you press the Backspace key, the ASCII table shows why. If your Tab key ever stopped working, your could try pressing Ctl-I instead, rather than use the space bar to indent: it usually worked in the pine editor at least 😀

Isn't this rather outdated today?

Yes, it is, sort of. Today computers almost always use Unicode, often encoded in UTF-8, to encode the codepoints of characters for all of the world's written languages. Many computers often have a lot of the glyphs needed to display non-European letters too.

But the Unicode, and UTF-8 encoding, deliberately preserve the first 7-bit, 128 characters of US-ASCII, so that control characters are still valid, even on modern computers — though Unix and Windows, may interpret the controls slightly differently, and many of the codes are replaced by out-of-band mechanisms in lower network transport layers.

The notion of in-band control characters dates back at least as far as teletype machines — in fact, it goes back all the way to telegraphy and Morse code, with words like STOP. Even punctuation in written language is a kind of control character, and in-band control is still carried forward today on the Web, with HTML tags and HTTP headers.

It's important these days to keep in mind that the idea of 8-bits = one character = one plain text code point, is very obsolete. I've pointed it out before, and it's very well explained by this 2011/2015 Kuntstube post, which featured on the Hacker News yesterday and prompted me to reminisce about the old days.

I hope you enjoyed this little journey through the ASCII code table.

Happy hacking!