Internally, computers represent all data using **bits**:
Each bit is an individual atom of memory that can be
either off or on, which we interpret as 0 (off) or 1 (on).
In this document, we'll study how computers use bits to
represent **integers** — numbers with no fractional
part, like 2, 105, or −38 — and **characters**
— the symbols you can type on a keyboard and that are
saved into files.

Before we discuss how computers represent integers, we must first examine our basic numeral system.

You're already familiar with the **decimal numeral
system**.
You may remember the following sort of diagram from grade
school.

1 | 0 | 2 | 4 |

1000 | 100 | 10 | 1 |

This diagram has a line underneath each digit of the number
1024, and underneath each line is a reminder of how much that
place is worth. In representing the number 1024,
we have a 4 in the ones place, a 2 in the tens place,
a 0 in the hundreds places, and a 1 in the thousands place.
This system is also called **base 10** because it
is based on the number 10: There are 10 possible symbols for each place
(0 through 9), and the place values go up by factors of 10
(1, 10, 100, 1000,…).

We call the 10 symbols 0 through 9 *digits*
based on the Latin word for *finger*,
because counting on fingers is the origin of our counting system.
Of course, computers aren't endowed with such fingers,
and so it shouldn't be surprising that this numeral system is
less convenient for computers.
Instead, computers count with bits, which have two possible states,
and so we use a 2-based system — the **binary numeral
system**.

In the binary numeral system, we have only two symbols 0 and 1,
which we call *bits* based on contracting
the phrase *binary digit*.
Also, the place values go up by factors of 2,
so our places are worth 1, 2, 4, 8, 16, and so on.
The following diagrams a number written in binary notation.

1 | 0 | 1 | 1 |

8 | 4 | 2 | 1 |

This value, 1011_{(2)}, represents a number with
1 eight, 0 fours, 1 two, and 1 one:
We perform the addition
1 ⋅ 8 + 0 ⋅ 4 + 1 ⋅ 2 + 1 ⋅ 1 = 11_{(10)},
and we conclude that this binary number 1011_{(2)}
is an alternative representation for the number we know as
eleven, and which we write as 11 in decimal notation.
(The parenthesized subscripts indicate whether the
number is in binary notation or decimal notation.)

We'll often want to convert numbers between their binary and decimal
representations. With 1011_{(2)}, we already saw one example of
converting in the direction from binary to decimal.
But here's another example: Suppose we want to
identify what 100100_{(2)} represents.
We first determine what places contain the one bits.

1 | 0 | 0 | 1 | 0 | 0 |

32 | 16 | 8 | 4 | 2 | 1 |

We then add up the values of these places to get a base-10
value: The 32's place and the 4's place are filled with one
bits,
so we compute 32 + 4 = 36_{(10)}.

To convert a number from decimal to binary,
we repeatedly determine the largest power of two
that fits into the number
and subtract it,
until we reach zero; the binary representation has a 1 bit in
each place whose value we subtracted, and a 0 bit in the
remaining places. Suppose, as an example,
we want to convert 88_{(10)} to binary.
We observe the largest power of 2 less than 88 is 64, so we decide that
the binary expansion of 88 has a 1 in the 64's place, and we subtract
64 to get 88 − 64 = 24. Then we see than the largest power of 2 less than 24 is
16, so we decide to put a 1 in the 16's place and subtract 16 from 24 to
get 8. Now 8 is the largest power of 2 that fits into 8, so we put
a 1 in the 8's place and subtract to get 0.
Once we reach 0, we write down which places we filled with
1's.

1 | 1 | 1 | ||||

64 | 32 | 16 | 8 | 4 | 2 | 1 |

We put a zero in each empty place and conclude that the binary
representation of 88_{(10)} is 1011000_{(2)}.

Modern computers represent all integers using the same amount of space.
For example, we might decide that each byte represents a number.
(A **byte** is a group of eight bits.)
A byte, however, is very limiting: The largest number we can
fit is 11111111_{(2)} = 255_{(10)}, and we often want to deal with larger
numbers than that.

Thus, computers tend to use groups of bytes called **words**.
Different computers have different word sizes.
In the past, many machines had 16-bit words;
today, most machines use 32-bit words,
though many use 64-bit words.
(The term *word* comes from the fact that four bytes (32 bits) is
equivalent to four ASCII characters, and four letters is the length of
many useful English words.)
Thirty-two bits is plenty for most numbers, as it allows us to represent
any integer from 0 up to 2^{32} − 1 = 4,294,967,295.
But the limitation is becoming increasingly irritating —
primarily because it leads to problems when you have more than
4 gigabytes of memory (4 gigabytes is 2^{32} bytes) —
and so larger systems frequently use 64-bit words.

Written text is among the most important data processed by a computer, and it too merits some study.

Early computers did not have a standard way of encoding characters
into binary, which soon proved unsatisfactory once data began being transferred between one computer and another. So in the early 1960's, an American standards organization now known as ANSI took up the charge of designing a standard encoding. They named their encoding the American Code for Information Interchange, though this was soon forgotten as people called it by its acronym, **ASCII**, and it turned into the basic standard for foolproof compatibility.

ASCII uses seven bits to encode each character, allowing for
2^{7} = 128 different encodings. This basically includes every symbol that you can find on an English keyboard, plus a few **control characters** reserved for encoding non-printable information like an instruction to ignore the previous character. Most of these control characters are basically obsolete; those still in common use include the following.

0x00: | NUL | — used to terminate strings in some systems |

0x08: | BS | — sent by the backspace key to remove character before cursor |

0x09: | HT | — sent by the tab key |

0x0A: | LF | — used to separate lines in a file |

0x0C: | CR | — often required to precede LF for legacy reasons |

0x1F: | ESC | — sent by the escape key |

Representing line breaks is somewhat interesting: Early systems based on typewriters would have two actions when the user completes a line: It would first do a “carriage return” to move the paper or print head so that the next typed character would go on the left side of the page, followed by a “line feed” to scroll the paper a bit so that the next typed character goes on the following line. ASCII had two separate characters representing these two actions, named CR and LF; files would be stored with both characters for each line break. The carriage return would be sent first because moving all the way across the page horizontally took longer than scrolling the paper to the next line, so you would want to start moving horizontally first.

In order to preserve compatibility with older systems, many systems have copied this same convention of breaking lines using a CR character followed by an LF character. And indeed many systems today still use this convention, including Microsoft Windows systems and most Internet protocols. However, other systems were designed to use just one character to save space; this includes Unix and from there MacOSX and Linux, which use the LF character only to separate lines.

Beyond the control characters are the printable characters, which
are represented as given in the table below. As an example of
how to read this table, look at the capital *A* in the
fifth row. This row is labeled 1000`xxx`, and it
is in column 001; putting the row and column together to form
the binary code 1000 001, this entry says that *A* is
represented in ASCII with the code 1000001.

xxx | ||||||||||||||||

000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | |||||||||

0100xxx |
† | ! | " | # | $ | % | & | ' | ||||||||

0101xxx |
( | ) | * | + | , | - | . | / | ||||||||

0110xxx |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||||||||

0111xxx |
8 | 9 | : | ; | < | = | > | ? | ||||||||

1000xxx |
@ | A | B | C | D | E | F | G | ||||||||

1001xxx |
H | I | J | K | L | M | N | O | ||||||||

1010xxx |
P | Q | R | S | T | U | V | W | ||||||||

1011xxx |
X | Y | Z | [ | \ | ] | ^ | _ | ||||||||

1100xxx |
` | a | b | c | d | e | f | g | ||||||||

1101xxx |
h | i | j | k | l | m | n | o | ||||||||

1110xxx |
p | q | r | s | t | u | v | w | ||||||||

1111xxx |
x | y | z | { | | | } | ~ | ‡ | ||||||||

† space character (typed using the space bar) | ||||||||||||||||

‡ DEL control character (rarely used) |

As you can see, ASCII places the digits in sequential order, followed by the capital letters in sequential order followed by the lower-case letters. Punctuation marks are interspersed in between so that the digits and letters start toward the beginning of their respective rows.

Since modern computers use eight bits in each byte, a very natural way to use ASCII is to allocate one byte for each character. Of course, that leaves an extra bit, which could be used for a variety of purposes. During transmission of information, some systems have used this additional bit to signify whether an odd number of the other bits are 1, in the hope of identifying the occasional mistransmitted bit. Some systems have used this additional bit to represent whether the text should be highlighted (probably by using inverted text). But the most common technique has been to define it so that the additional 128 bit sequences represent other characters not included in ASCII.

The most common such extension in use is the Latin1 encoding, which adds
the accent marks and a few other non-Latin characters
necessary for supporting a wide variety of
European languages like Spanish and German, as well as additional punctuation including currency symbols beyond the dollar *$*.

But there are many others, supporting different alphabets such as Cyrillic (Russian), Hebrew, and Greek. In fact, two organizations collaborated on a set of 15 such standards, of which Latin1 is the first; collectively, they are called ISO/IEC 8859.

Having many alternative encoding standards is confusing, and in any case it doesn't address East Asian writing systems, such as those for Chinese and Japanese, which can use tens of thousands of symbols. For this reason, a group got together to define a 16-bit encoding, which they called Unicode. Sixteen bits allows for 65,536 different characters, which covers nearly all characters in modern use. They published their standard in 1991–92.

While this was nearly sufficient, they eventually decided that they needed more room, so they extended the encoding to go up to about 1.1 million. This whole space is unlikely ever to be exhausted.

Today's Unicode standard includes all the common alphabets as well as mathematical symbols, many alphabets only of historic interest (e.g., Egyptian hieroglyphs), and other occasionally useful symbols, like the queen of spades (, or 🂭 if your browser's font supports it). However, enumerating all Chinese characters is basically impossible, so it inevitably omits several.

While ASCII and its 8-bit extensions were the dominant encoding through the 1990's, most systems today are migrating toward using Unicode. This is a complex process, both because Unicode is itself complex and because so much software has been written on the basis that each character is exactly one byte long.