Internally, computers represent all data using bits: Each bit is an individual atom of memory that can be either off or on, which we interpret as 0 (off) or 1 (on). In this document, we'll study how computers use bits to represent integers — numbers with no fractional part, like 2, 105, or −38 — and characters — the symbols you can type on a keyboard and that are saved into files.
Before we discuss how computers represent integers, we must first examine our basic numeral system.
You're already familiar with the decimal numeral system. You may remember the following sort of diagram from grade school.
This diagram has a line underneath each digit of the number 1024, and underneath each line is a reminder of how much that place is worth. In representing the number 1024, we have a 4 in the ones place, a 2 in the tens place, a 0 in the hundreds places, and a 1 in the thousands place. This system is also called base 10 because it is based on the number 10: There are 10 possible symbols for each place (0 through 9), and the place values go up by factors of 10 (1, 10, 100, 1000,…).
We call the 10 symbols 0 through 9 digits based on the Latin word for finger, because counting on fingers is the origin of our counting system. Of course, computers aren't endowed with such fingers, and so it shouldn't be surprising that this numeral system is less convenient for computers. Instead, computers count with bits, which have two possible states, and so we use a 2-based system — the binary numeral system.
In the binary numeral system, we have only two symbols 0 and 1, which we call bits based on contracting the phrase binary digit. Also, the place values go up by factors of 2, so our places are worth 1, 2, 4, 8, 16, and so on. The following diagrams a number written in binary notation.
This value, 1011(2), represents a number with 1 eight, 0 fours, 1 two, and 1 one: We perform the addition 1 ⋅ 8 + 0 ⋅ 4 + 1 ⋅ 2 + 1 ⋅ 1 = 11(10), and we conclude that this binary number 1011(2) is an alternative representation for the number we know as eleven, and which we write as 11 in decimal notation. (The parenthesized subscripts indicate whether the number is in binary notation or decimal notation.)
We'll often want to convert numbers between their binary and decimal representations. With 1011(2), we already saw one example of converting in the direction from binary to decimal. But here's another example: Suppose we want to identify what 100100(2) represents. We first determine what places contain the one bits.
We then add up the values of these places to get a base-10 value: The 32's place and the 4's place are filled with one bits, so we compute 32 + 4 = 36(10).
To convert a number from decimal to binary, we repeatedly determine the largest power of two that fits into the number and subtract it, until we reach zero; the binary representation has a 1 bit in each place whose value we subtracted, and a 0 bit in the remaining places. Suppose, as an example, we want to convert 88(10) to binary. We observe the largest power of 2 less than 88 is 64, so we decide that the binary expansion of 88 has a 1 in the 64's place, and we subtract 64 to get 88 − 64 = 24. Then we see than the largest power of 2 less than 24 is 16, so we decide to put a 1 in the 16's place and subtract 16 from 24 to get 8. Now 8 is the largest power of 2 that fits into 8, so we put a 1 in the 8's place and subtract to get 0. Once we reach 0, we write down which places we filled with 1's.
We put a zero in each empty place and conclude that the binary representation of 88(10) is 1011000(2).
Modern computers represent all integers using the same amount of space. For example, we might decide that each byte represents a number. (A byte is a group of eight bits.) A byte, however, is very limiting: The largest number we can fit is 11111111(2) = 255(10), and we often want to deal with larger numbers than that.
Thus, computers tend to use groups of bytes called words. Different computers have different word sizes. In the past, many machines had 16-bit words; today, most machines use 32-bit words, though many use 64-bit words. (The term word comes from the fact that four bytes (32 bits) is equivalent to four ASCII characters, and four letters is the length of many useful English words.) Thirty-two bits is plenty for most numbers, as it allows us to represent any integer from 0 up to 232 − 1 = 4,294,967,295. But the limitation is becoming increasingly irritating — primarily because it leads to problems when you have more than 4 gigabytes of memory (4 gigabytes is 232 bytes) — and so larger systems frequently use 64-bit words.
Written text is among the most important data processed by a computer, and it too merits some study.
Early computers did not have a standard way of encoding characters into binary, which soon proved unsatisfactory once data began being transferred between one computer and another. So in the early 1960's, an American standards organization now known as ANSI took up the charge of designing a standard encoding. They named their encoding the American Code for Information Interchange, though this was soon forgotten as people called it by its acronym, ASCII, and it turned into the basic standard for foolproof compatibility.
ASCII uses seven bits to encode each character, allowing for 27 = 128 different encodings. This basically includes every symbol that you can find on an English keyboard, plus a few control characters reserved for encoding non-printable information like an instruction to ignore the previous character. Most of these control characters are basically obsolete; those still in common use include the following.
|0x00:||NUL||— used to terminate strings in some systems|
|0x08:||BS||— sent by the backspace key to remove character before cursor|
|0x09:||HT||— sent by the tab key|
|0x0A:||LF||— used to separate lines in a file|
|0x0C:||CR||— often required to precede LF for legacy reasons|
|0x1F:||ESC||— sent by the escape key|
Representing line breaks is somewhat interesting: Early systems based on typewriters would have two actions when the user completes a line: It would first do a “carriage return” to move the paper or print head so that the next typed character would go on the left side of the page, followed by a “line feed” to scroll the paper a bit so that the next typed character goes on the following line. ASCII had two separate characters representing these two actions, named CR and LF; files would be stored with both characters for each line break. The carriage return would be sent first because moving all the way across the page horizontally took longer than scrolling the paper to the next line, so you would want to start moving horizontally first.
In order to preserve compatibility with older systems, many systems have copied this same convention of breaking lines using a CR character followed by an LF character. And indeed many systems today still use this convention, including Microsoft Windows systems and most Internet protocols. However, other systems were designed to use just one character to save space; this includes Unix and from there MacOSX and Linux, which use the LF character only to separate lines.
Beyond the control characters are the printable characters, which are represented as given in the table below. As an example of how to read this table, look at the capital A in the fifth row. This row is labeled 1000xxx, and it is in column 001; putting the row and column together to form the binary code 1000 001, this entry says that A is represented in ASCII with the code 1000001.
|† space character (typed using the space bar)|
|‡ DEL control character (rarely used)|
As you can see, ASCII places the digits in sequential order, followed by the capital letters in sequential order followed by the lower-case letters. Punctuation marks are interspersed in between so that the digits and letters start toward the beginning of their respective rows.
Since modern computers use eight bits in each byte, a very natural way to use ASCII is to allocate one byte for each character. Of course, that leaves an extra bit, which could be used for a variety of purposes. During transmission of information, some systems have used this additional bit to signify whether an odd number of the other bits are 1, in the hope of identifying the occasional mistransmitted bit. Some systems have used this additional bit to represent whether the text should be highlighted (probably by using inverted text). But the most common technique has been to define it so that the additional 128 bit sequences represent other characters not included in ASCII.
The most common such extension in use is the Latin1 encoding, which adds the accent marks and a few other non-Latin characters necessary for supporting a wide variety of European languages like Spanish and German, as well as additional punctuation including currency symbols beyond the dollar $.
But there are many others, supporting different alphabets such as Cyrillic (Russian), Hebrew, and Greek. In fact, two organizations collaborated on a set of 15 such standards, of which Latin1 is the first; collectively, they are called ISO/IEC 8859.
Having many alternative encoding standards is confusing, and in any case it doesn't address East Asian writing systems, such as those for Chinese and Japanese, which can use tens of thousands of symbols. For this reason, a group got together to define a 16-bit encoding, which they called Unicode. Sixteen bits allows for 65,536 different characters, which covers nearly all characters in modern use. They published their standard in 1991–92.
While this was nearly sufficient, they eventually decided that they needed more room, so they extended the encoding to go up to about 1.1 million. This whole space is unlikely ever to be exhausted.
Today's Unicode standard includes all the common alphabets as well as mathematical symbols, many alphabets only of historic interest (e.g., Egyptian hieroglyphs), and other occasionally useful symbols, like the queen of spades (, or 🂭 if your browser's font supports it). However, enumerating all Chinese characters is basically impossible, so it inevitably omits several.
While ASCII and its 8-bit extensions were the dominant encoding through the 1990's, most systems today are migrating toward using Unicode. This is a complex process, both because Unicode is itself complex and because so much software has been written on the basis that each character is exactly one byte long.