Control Characters
Have you ever wondered about the Unix terminal control characters: Ctl-C
to interrupt a program, or Ctl-D
to close the terminal? Do you know what Ctl-L
does? Why does your computer beep if you press Ctl-G
, but not other control keys?
The answer is that by holding down the Control
key and typing a letter, you're sending a “control character” to the computer.
But what are these control characters, and why is there a seemingly random association between letters of the alphabet, and functions of the terminal?
First, the quick answer: remember that everything inside a computer is stored as numbers, even letters. When you type a key to the computer with the Control
key held down, what happens is that the numeric encoding of the letter you type first has 64 (0x40
) taken away from it, and then that is sent. Why this value? It's the most-significant-bit (MSB) — the 7th bit — of the US-ASCII teletype character code table.
Looking at an old 7-bit ASCII table, we can see what this will do, but the arrangement of the table is important for making this clear. Usually the table is layed out in a list fashion, which doesn't expose what I'm about to point out. Take a look at this table which is a 16 × 8 grid of code points:
To read this table, you use the row and then the column to work out what the encoding is for each character. For instance the letter A is character A
, located in the ASCII table at row 4
, column 1
, encoded as 0x41
in hexadecimal, or 100 0001
in binary (it's also 65 decimal, though the decimal value is more confusing than helpful).
Some facts about character encodings in ASCII begin to make some sense when the table is arranged this way:
- The first two rows of
0x20
characters (32 characters) are used to encode terminal control commands, in-line of the terminal data stream - There is a simple relation between the number-characters and their character encodings: set the three highest bits of the character encoding to
000
(by subtracting0x30
) and you have their numeric value, in binary, from the column number. For example ASCII character1
(0x31
) is the numeric value0x31 - 0x30 = 0x01 = 1
, ASCII2
's value is0x02
, and so on - Similarly, each letter's position in the alphabet can be calculated by subtracting
0x40
from the ASCII code:A
=0x41
and is the0x41-0x40 = 0x1
st letter of the alphabet;P = 0x50
and is0x50 - 0x40 = 0x10
th (that is, 16th), and so on - There is a simple relation between a capital letter and its miniscule pair:
A
is two rows abovea
in the table, which is a difference of0x20
. This means that implementing aShift
key on a terminal could be just a matter of clearing the 6th bit, for instance shift from110 0001
fora
to100 0001
forA
- Extending this, you can see a simple relation between the letters on rows
4
and5
and the controls on rows0
and1
: you can produce the control characters by clearing the 7th bit with theControl
key, which is like pressingShift
to clear the 6th bit
The ASCII code, despite its limitations so far as language support goes, is a very elegant code, with a lot of clever features, a bit like the periodic table of the elements.
Control characters
So, lets look briefly at the control characters I teased about in the introduction.
- Pressing
Ctl-C
sends000 0011
instead of100 0011
. This is the encoding forETX
(End of Text) and marks the end of a record, or acts as a Break or Interrupt signal - Pressing
Ctl-D
sends000 0100
which isEOT
(End of Transmission), and ends the communication on a terminal, and so also hangs up the session - Pressing
Ctl-L
sends000 1100
which isFF
(Form Feed). On an old teletype paper terminal, this feeds the paper through to the next page. A video terminal clears the screen - And the
Ctl-G
? That sends000 0111
which isBEL
, to ring the bell on an old teletype. Modern terminals, lacking a bell, make a beep instead
Now it is also clear why viewing a Windows or DOS text file on a Unix computer (or adding it to git
without normalising line-endings) will show ^M
on the end of each line: the CR
(Carriage Return) is unprintable, so instead the terminal displays a caret ^
and then the character that results from flipping the 7th bit back on (000 1101
becomes 100 1101
, which is M
). Viewing a Unix text file on Windows looks strange as well, because these files only contain the LF
(Line Feed) characters: Windows is technically correct not to automatically return the virtual carriage since there is no CR
.
I recall using old Digital VT-320 video terminals at the UTas computer lab back in 1993. Misconfigured terminals would often show ^H
when you press the Backspace
key, the ASCII table shows why. If your Tab
key ever stopped working, your could try pressing Ctl-I
instead, rather than use the space bar to indent: it usually worked in the pine
editor at least 😀
Isn't this rather outdated today?
Yes, it is, sort of. Today computers almost always use Unicode, often encoded in UTF-8, to encode the codepoints of characters for all of the world's written languages. Many computers often have a lot of the glyphs needed to display non-European letters too.
But the Unicode, and UTF-8 encoding, deliberately preserve the first 7-bit, 128 characters of US-ASCII, so that control characters are still valid, even on modern computers — though Unix and Windows, may interpret the controls slightly differently, and many of the codes are replaced by out-of-band mechanisms in lower network transport layers.
The notion of in-band control characters dates back at least as far as teletype machines — in fact, it goes back all the way to telegraphy and Morse code, with words like STOP
. Even punctuation in written language is a kind of control character, and in-band control is still carried forward today on the Web, with HTML tags and HTTP headers.
It's important these days to keep in mind that the idea of 8-bits = one character = one plain text code point, is very obsolete. I've pointed it out before, and it's very well explained by this 2011/2015 Kuntstube post, which featured on the Hacker News yesterday and prompted me to reminisce about the old days.
I hope you enjoyed this little journey through the ASCII code table.
Happy hacking!