Representing data

by Carl Burch, Hendrix College, September 2012

Representing data by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
Based on a work at www.cburch.com/books/bitops/.

1. Numeral systems
1.1. Binary numbers
1.2. Hexadecimal and octal
2. Integers
2.1. Unsigned integer representation
2.2. Sign-magnitude representation
2.3. Two's-complement representation
2.4. Adding two's-complement numbers
3. Characters
3.1. ASCII
3.2. 8-bit encodings
3.3. Unicode
3.4. Unicode encodings

Internally, computers represent all data using bits: Each bit is an individual atom of memory that can be either off or on, which we interpret as 0 (off) or 1 (on). In this document, we'll study how computers use bits to represent integers — numbers with no fractional part, like 2, 105, or −38 — and we'll learn how a program can manipulate this integer representation directly using bit operators.

1. Numeral systems

Before we discuss how computers represent integers, we must first examine our basic numeral systems.

1.1. Binary numbers

You're already familiar with the decimal numeral system. You may remember the following sort of diagram from grade school.

1	0	2	4
1000	100	10	1

This diagram has a line underneath each digit of the number 1024, and underneath each line is a reminder of how much that place is worth. In representing the number 1024, we have a 4 in the ones place, a 2 in the tens place, a 0 in the hundreds places, and a 1 in the thousands place. This system is also called base 10 because it is based on the number 10: There are 10 possible symbols for each place (0 through 9), and the place values go up by factors of 10 (1, 10, 100, 1000,…).

We call the 10 symbols 0 through 9 digits based on the Latin word for finger, because counting on fingers is the origin of our counting system. Of course, computers aren't endowed with such fingers, and so it shouldn't be surprising that this numeral system is less convenient for computers. Instead, computers count with bits, which have two possible states, and so we use a 2-based system — the binary numeral system.

In the binary numeral system, we have only two symbols 0 and 1, which we call bits based on contracting the phrase binary digit. Also, the place values go up by factors of 2, so our places are worth 1, 2, 4, 8, 16, and so on. The following diagrams a number written in binary notation.

1	0	1	1
8	4	2	1

This value, 1011₍₂₎, represents a number with 1 eight, 0 fours, 1 two, and 1 one: We perform the addition 1 ⋅ 8 + 0 ⋅ 4 + 1 ⋅ 2 + 1 ⋅ 1 = 11₍₁₀₎, and we conclude that this binary number 1011₍₂₎ is an alternative representation for the number we know as eleven, and which we write as 11 in decimal notation. (The parenthesized subscripts indicate whether the number is in binary notation or decimal notation.)

We'll often want to convert numbers between their binary and decimal representations. With 1011₍₂₎, we already saw one example of converting in the direction from binary to decimal. But here's another example: Suppose we want to identify what 100100₍₂₎ represents. We first determine what places contain the one bits.

1	0	0	1	0	0
32	16	8	4	2	1

We then add up the values of these places to get a base-10 value: The 32's place and the 4's place are filled with one bits, so we compute 32 + 4 = 36₍₁₀₎.

To convert a number from decimal to binary, we repeatedly determine the largest power of two that fits into the number and subtract it, until we reach zero; the binary representation has a 1 bit in each place whose value we subtracted, and a 0 bit in the remaining places. Suppose, as an example, we want to convert 88₍₁₀₎ to binary. We observe the largest power of 2 less than 88 is 64, so we decide that the binary expansion of 88 has a 1 in the 64's place, and we subtract 64 to get 88 − 64 = 24. Then we see than the largest power of 2 less than 24 is 16, so we decide to put a 1 in the 16's place and subtract 16 from 24 to get 8. Now 8 is the largest power of 2 that fits into 8, so we put a 1 in the 8's place and subtract to get 0. Once we reach 0, we write down which places we filled with 1's.

1		1	1
64	32	16	8	4	2	1

We put a zero in each empty place and conclude that the binary representation of 88₍₁₀₎ is 1011000₍₂₎.

1.2. Hexadecimal and octal

For humans, reading and typing binary numbers is a major hassle: The numbers tend to be long, and we're better with shorter numbers, even if there are more symbols to choose from. Computer programmers often find base 16 the most convenient choice — called the hexadecimal numeral system, often abbreviated as hex.

One of the first problems we face in talking about hexadecimal is that we need 16 different symbols for each digit, and we only know the 10 symbols 0 through 9. We fill out the remaining six symbols with letters from the alphabet:

A is 10₍₁₀₎     D is 13₍₁₀₎

B is 11₍₁₀₎     E is 14₍₁₀₎

C is 12₍₁₀₎     F is 15₍₁₀₎

Thus, we can plausibly talk about the number 1B4 in the hexadecimal system. Let's see what 1B4 translates to in decimal: As before, we first insert the place values, which now go up by factors of 16, starting with 1, 16, 256, and 4096.

1	B	4
256	16	1

We see that 1B4₍₁₆₎ has one 256, eleven 16's, and four 1's, so we compute 1 · 256 + 11 · 16 + 4 · 1 = 436₍₁₀₎. We conclude that 1B4₍₁₆₎ = 436₍₁₀₎.

But surely, you might object, decimal would be more convenient than hexadecimal! The special thing about hexadecimal is that it is based on a power of 2: 2⁴ = 16. This leads to a simple technique for converting between hexadecimal and binary, much quicker than anything available for decimal: We simply convert each hexadecimal digit into four bits.

Let's look at an example of this conversion using 1B4₍₁₆₎. We observe that the first digit 1 has the four-bit binary equivalent of 0001, the second digit B has the four-bit binary equivalent of 1011, and the third digit 4 has the four-bit binary equivalent of 0100. Put these together, and we arrive at the binary number 0001 1011 0100₍₂₎.

Going the other direction is just as easy. Let's consider the binary number 1011010110₍₂₎. We break it up into groups of four, starting from the bottom four bits, arriving at 10 1101 0110; and then we translate each group into the corresponding hexadecimal digit: 10₍₂₎ = 2, 1101₍₂₎ = 13₍₁₀₎ = D₍₁₆₎, and 0110₍₂₎ = 6. We therefore conclude that 1011010110₍₂₎ = 2D6₍₁₆₎.

Programmers also occasionally use base 8, called octal. Here, our digits are 0 through 7, and the place values are 1, 8, 64, 512, and so on. Thus 31₍₈₎ translates to three 8's and one 1, which is 3 ⋅ 8 + 1 ⋅ 1 = 25₍₁₀₎. (This leads to a simple joke: Why do programmers have a hard time distinguishing Halloween from Christmas? Because OCT 31 = DEC 25.)

Octal has the advantage that you can type it on a numeric keypad, but most programmers prefer hexadecimal — in part because groups of four bits conveniently pair into groups of eight, corresponding to the bytes that a computer uses.

These alternative bases are useful enough that C and its imitators (including Python and Java) have a special technique for incorporating octal and hexadecimal numbers into programs. If you prefix a number with the digit 0, then this indicates an octal number: 031 is an equivalent way of writing 25. Similarly, prefixing a number with the two-character sequence 0x indicates a hexadecimal number: 0x1b4 is equivalent to 436. When you see this in the program, it usually indicates that the programmer views the particular binary pattern as more important than its numeric value.

2. Integers

Now we can look at how computers actually store integers.

2.1. Unsigned integer representation

Modern computers represent all integers using the same amount of space. For example, we might decide that each byte represents a number. (A byte is a group of eight bits.) A byte, however, is very limiting: The largest number we can fit is 11111111₍₂₎ = 255₍₁₀₎, and we often want to deal with larger numbers than that.

Thus, computers tend to use groups of bytes called words. Different computers have different word sizes. In the past, many machines had 16-bit words; today, most machines use 32-bit words, though many use 64-bit words. (The term word comes from the fact that four bytes (32 bits) is equivalent to four ASCII characters, and four letters is the length of many useful English words.) Thirty-two bits is plenty for most numbers, as it allows us to represent any integer from 0 up to 2³² − 1 = 4,294,967,295. But the limitation is becoming increasingly irritating — primarily because it leads to problems when you have more than 4 gigabytes of memory (4 gigabytes is 2³² bytes) — and so larger systems frequently use 64-bit words.

At first, that looks simple enough, but there's a minor but significant detail: In storing the 32-bit number among four bytes, should you store the lowest-order 8 bits or the highest-order 8 bits in the firs byte (i.e., the byte with the lowest address)? We write numbers with the highest-order bits first, which leads one to at first think we should start there; but we typically do arithmetic (like addition and multiplication) starting from the lowest-order bits, so arguably it's more intuitive to start with the low end — though we do comparisons and thus sorting starting from the highest-order bits, which puts us back where we started. The overall consensus is that the case for each is basically neutral, but established systems have been built using each method. Systems that start with the highest-order bits are called big-endian, while ones that start with the lowest-order bits are called little-endian. (This is a reference to the historic feud between Lilliput and Blefuscu over which end boiled eggs should be broken, as documented in Gulliver's Travels.) Most Intel-based systems (including nearly all modern PCs) use the little-endian system, while Internet protocols typically specify the big-endian system.

The representation of an integer using binary representation in a fixed number of bits is called an unsigned representation. The term comes from the fact that the only numbers representable in the system have no negative sign.

But what about negative integers? After all, there are some perfectly respectable numbers below 0. We'll examine two techniques for representing integers both negative and positive: sign-magnitude representation and two's-complement representation.

2.2. Sign-magnitude representation

Sign-magnitude representation is the more intuitive technique. Here, we let the first bit indicate whether the number is positive or negative (the number's sign), and the rest of the bits tell how far the number is from 0 (its magnitude). Suppose we are working with 8-bit sign-magnitude numbers.

10 is represented as 00001010

−10 is represented as 10001010

For −10₍₁₀₎, we use 1 for the first bit, because the number is negative, and then we place 10₍₁₀₎ = 1010₍₂₎ into the remaining seven bits.

What's the range of integers we can represent with an 8-bit sign-magnitude representation? For the largest number, we'd want 0 for the sign bit and 1 everywhere else, giving us 01111111, or 127₍₁₀₎. For the smallest number, we'd want 1 for the sign bit and 1 everywhere else, giving us −127₍₁₀₎. An 8-bit sign-magnitude representation, then, can represent any integer from −127₍₁₀₎ to 127₍₁₀₎.

This range of integers includes 255 values. But we've seen that 8 bits can represent up to 256 different values. The discrepancy arises from the fact that the representation includes two representations of the number zero (0 and −0, represented as 00000000 and 10000000).

Arithmetic using sign-magnitude representation is somewhat more complicated than we might hope. To build a circuit comparing two numbers, you would need additional circuitry so that −0 is understood as equal to 0. Moreover, adding two numbers requires separate circuitry to handle the cases of when the numbers' signs match and when they don't match. Because of these complications, today's computers don't use sign-magnitude representation for integers. (It is used regularly in floating-point numbers, which are beyond the scope of this document.)

2.3. Two's-complement representation

Computers today use the two's-complement representation for integers. In the two's-complement system, the topmost bit's value is the negation of its meaning in an unsigned system. For example, in an 8-bit unsigned system, the topmost bit is the 128's place.


128	64	32	16	8	4	2	1

In an 8-bit two's-complement system, then, we negate the meaning of the topmost bit to be −128 instead.


−128	64	32	16	8	4	2	1

If we were working with 32-bit numbers — as is typical — then we'd still negate the meaning of the topmost bit. In a 32-bit unsigned representation, the topmost bit is worth 2³¹, so in a 32-bit two's-complement representation, the topmost bit is worth −2³¹.

Let's see how a number is represented using two's-complement representation; we'll return to an 8-bit system, since it's easier to work examples there. We'll follow the same algorithm from for converting from decimal to binary. Let's look at the example of −100₍₁₀₎. We first choose a 1 for the −128's place, leaving us with (−100) − (−128) = 28. (Since the place value is negative, we subtract a negative number.) Then we'd choose a 1 for the 16's place, the 8's place, and the 4's place to reach 0.

1	0	0	1	1	1	0	0
−128	64	32	16	8	4	2	1

Thus, the 8-bit two's-complement representation of −100₍₁₀₎ would be 10011100.

To see the relationship between positive and negative representations, let's look at representing some numbers and their negations.

100 is represented as 01100100

−100 is represented as 10011100

10 is represented as 00001010

−10 is represented as 11110110

You see that the negation of a number involves flipping the first several bits of the number, but the last bits remain the same. For 100, we flip all but the last three bits; for 10, only the last two bits remain the same. The bits that remain the same include the last 1 and all succeeding 0's. That's a general rule for negating any two's-complement number: Flip all bits above the lowest 1.

What's the range of numbers representable in an 8-bit two's-complement representation? To arrive at the largest number, we'd want 0 for the −128's bit and 1 everywhere else, giving us 01111111, or 127₍₁₀₎. For the smallest number, we'd want 1 for the −128's bit and 0 everywhere else, giving −128₍₁₀₎. In an 8-bit two's-complement representation, we can represent any integer from −128₍₁₀₎ up to 127₍₁₀₎. This range includes 256 integers; there are no duplicates as with sign-magnitude representation.

It's instructive to map out the bit patterns (in order of their unsigned value) and their corresponding two's-complement values.

bits value

00000000 0

00000001 1

00000010 2

:

01111110 126

01111111 127

10000000 −128

10000001 −128

:

11111110 −2

11111111 −1

bits	value
00000000	0
00000001	1
00000010	2
:
01111110	126
01111111	127
10000000	−128
10000001	−128
:
11111110	−2
11111111	−1

Notice that the two's-complement representation wraps around: If you take the largest number, 01111111, and add 1 to it as if it were an unsigned number, you get 10000000, the smallest number. This wrap-around behavior can lead to some interesting behavior. In one game I played as a child (back when 16-bit computers were popular), the score would go up progressively as you guided a monster through a maze. I wasn't very good at the game, but my little brother mastered it enough that the score would hit its maximum value and then wrap around to a very negative value! Trying to get the largest possible score — without wrapping around — was an interesting challenge.

2.4. Adding two's-complement numbers

One of the nice things about two's-complement numbers is that you can add them just as you add regular numbers. Suppose we want to add −3 and 5.

	1	1	1	1	1	1	0	1
+	1	1	1	1	1	1	0	1

We can attempt to do this using regular addition, akin to the technique we traditionally use in adding base-10 numbers, except that here adding 1 and 1 gives us 0 with a carry of 1, since 1 + 1 = 10₍₂₎.

1	1	1	1	1	1		1
	1	1	1	1	1	1	0	1
+	1	1	1	1	1	1	0	1
1	0	0	0	0	0	0	1	0

We get an extra 1 in the ninth bit of the answer, but if we ignore this ninth bit, we get the correct answer of 2₍₁₀₎.

We can reason that ignoring this ninth bit is correct as follows: Say one of the two numbers is negative and the other is positive. That is, one of the two numbers has a 1 in the −128's place, and the other has 0 there. If there is no carry into the −128's place, then the answer is OK, because that means we got the correct sum in the lower 7 bits, and then when we add the −128's place, we'll maintain the −128 represented by the 1 in that location in the negative number.

If there is a carry into the −128's place, then this represents a carry of 128 taken from summing the 64's column. This carry of 128 (represented by a carry of 1), added to the −128 in that column for the negative number (represented by a 1 in that column), should give us 0. This is exactly what we get we add the carry of 1 into the leftmost column to the 1 of the negative number in this column and then throw away the carry.

A similar sort of analysis will also work when we are adding two negative numbers or two positive numbers. Of course, the addition only works if you end up with something in the representable range. There's no hope of arriving at a good answer if you add 120 and 120 with 8-bit two's-complement numbers, since the number 240 simply isn't representable in that system. (The answer turns out to be −16!)

The upshot of all this is that when building a circuit for two's-complement numbers, we don't need to worry about different cases needing different circuits: The circuit simply needs to do regular unsigned addition and then ignore the final carry (if any).

3. Characters

Written text is among the most important data processed by a computer, and it too merits some study.

3.1. ASCII

Early computers did not have a standard way of encoding characters into binary, which soon proved unsatisfactory once data began being transferred between one computer and another. So in the early 1960's, an American standards organization now known as ANSI took up the charge of designing a standard encoding. They named their encoding the American Code for Information Interchange, though this was soon forgotten as people called it by its acronym, ASCII, and it turned into the basic standard for foolproof compatibility.

ASCII uses seven bits to encode each character, allowing for 2⁷ = 128 different encodings. This basically includes every symbol that you can find on an English keyboard, plus a few control characters reserved for encoding non-printable information like an instruction to ignore the previous character. Most of these control characters are basically obsolete; those still in common use include the following.

0x00:	NUL	— used to terminate strings in some systems
0x08:	BS	— sent by the backspace key to remove character before cursor
0x09:	HT	— sent by the tab key
0x0A:	LF	— used to separate lines in a file
0x0C:	CR	— often required to precede LF for legacy reasons
0x1F:	ESC	— sent by the escape key

Representing line breaks is somewhat interesting: Early systems based on typewriters would have two actions when the user completes a line: It would first do a “carriage return” to move the paper or print head so that the next typed character would go on the left side of the page, followed by a “line feed” to scroll the paper a bit so that the next typed character goes on the following line. ASCII had two separate characters representing these two actions, named CR and LF; files would be stored with both characters for each line break. The carriage return would be sent first because moving all the way across the page horizontally took longer than scrolling the paper to the next line, so you would want to start moving horizontally first.

In order to preserve compatibility with older systems, many systems have copied this same convention of breaking lines using a CR character followed by an LF character. And indeed many systems today still use this convention, including Microsoft Windows systems and most Internet protocols. However, other systems were designed to use just one character to save space; this includes Unix and from there MacOSX and Linux, which use the LF character only to separate lines.

Beyond the control characters are the printable characters, which are represented as given in the table below. Each row of the table represents ASCII characters that start with the same hexadecimal digit; the column headers indicate the second hexadecimal digit.

	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
0x2?	†	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
0x3?	0	1	2	3	4	5	6	7	8	9	:	;	<	=	>	?
0x4?	@	A	B	C	D	E	F	G	H	I	J	K	L	M	N	O
0x5?	P	Q	R	S	T	U	V	W	X	Y	Z	[	\	]	^	_
0x6?	`	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o
0x7?	p	q	r	s	t	u	v	w	x	y	z	{	\|	}	~	‡
† space character (typed using the space bar)
‡ DEL control character (rarely used)

As you can see, ASCII places the digits in sequential order, followed by the capital letters in sequential order followed by the lower-case letters. Punctuation marks are interspersed in between so that the digits and letters start toward the beginning of their respective rows.

3.2. 8-bit encodings

Since modern computers use eight bits in each byte, a very natural way to use ASCII is to allocate one byte for each character. Of course, that leaves an extra bit, which could be used for a variety of purposes. During transmission of information, some systems have used this additional bit to signify whether an odd number of the other bits are 1, in the hope of identifying the occasional mistransmitted bit. Some systems have used this additional bit to represent whether the text should be highlighted (probably by using inverted text). But the most common technique has been to define it so that the additional 128 bit sequences represent other characters not included in ASCII.

The most common such extension in use is the Latin1 encoding, which adds the accent marks and a few other non-Latin characters necessary for supporting a wide variety of European languages like Spanish and German, as well as additional punctuation including currency symbols beyond the dollar $.

But there are many others, supporting different alphabets such as Cyrillic (Russian), Hebrew, and Greek. In fact, two organizations collaborated on a set of 15 such standards, of which Latin1 is the first; collectively, they are called ISO/IEC 8859.

3.3. Unicode

Having many alternative encoding standards is confusing, and in any case it doesn't address East Asian writing systems, such as those for Chinese and Japanese, which can use tens of thousands of symbols. For this reason, a group got together to define a 16-bit encoding, which they called Unicode. Sixteen bits allows for 65,536 different characters, which covers nearly all characters in modern use. They published their standard in 1991–92.

While this was nearly sufficient, they eventually decided that they needed more room, so they extended the encoding to go up to the hexadecimal number 10FFFF₍₁₆₎ ≈ 1.1 million. This whole space is unlikely ever to be exhausted.

Each number in the space is called a code point and is typically referenced using a hexadecimal number of four to six digits prefixed by the characters “U+” as in the example U+03C0. The Unicode standard includes an English name for the character (in the case of U+03C0, it is “Greek small letter pi”) and an illustration of the relevant character (π). An example from the extended space is U+1F0AD with the English name “playing card queen of spades”; which is represented as 🂭 in the browser's font (if it includes that character) and pictured as in the standard.

Today's Unicode standard includes all the common alphabets as well as mathematical symbols, many alphabets only of historic interest (e.g., Egyptian hieroglyphs), and other occasionally useful symbols. However, enumerating all Chinese characters is basically impossible, so it inevitably omits several.

While ASCII and its 8-bit extensions were the dominant encoding through the 1990's, most systems today are migrating toward using Unicode. This is a complex process, both because Unicode is itself complex and because so much software has been written on the basis that each character is exactly one byte long.

3.4. Unicode encodings

Unicode becomes more complex when we talk about how the characters are encoded into binary. In the initial standard, it wasn't so difficult: All characters fit into 16 bits, so designers suggested simply that a computer should reserve 16 bits for each character. This is called the UCS-2 encoding, which some systems still use though the extension of Unicode to beyond 16 bits renders it obsolete.

UTF-16

The natural extension to UCS-2 is UTF-16. In UTF-16, characters whose code points fit into 16 bits are simply written as is. For the characters beyond U+FFFF (which includes the rarer Chinese characters, alphabets of primarily historical interest, and unusual symbols like playing cards), we subtract 10000₍₁₆₎ from the code point to arrive at a 20-bit number and divide it into two groups of 10 bits. We add the upper 10-bit number to D800₍₁₆₎ and encode the sum as a 16-bit number (which will be between D800₍₁₆₎ and DBFF₍₁₆₎; then we follow that with the 16-bit result of adding DC00₍₁₆₎ to the second group of 10 bits (the sum will be between DC00₍₁₆₎ and DFFF₍₁₆₎.

This system works because the standard has never defined the code points between U+D800 and U+DFFF, so a number between those two code points indicates that we are looking at an extended-Unicode character. Moreover, we can tell whether we're looking at the upper or lower 16 bits by seeing whether the code point is above or below DC00₍₁₆₎.

UTF-8

UTF-8 is an alternative encoding scheme that has the advantage of being backward compatible with ASCII-encoded files. It categorizes the code points into four categories.

U+0000 to U+007F: 0xxxxxxx: Simply represent these with a single byte. Characters in this range match the ASCII standard, and so a file stored entirely using ASCII would read exactly the same in UTF-8.
U+0080 to U+07FF: 110xxxxx 10xxxxxx: These code points fit into 11 bits and are encoded using two bytes. The first byte included should start with the bits 110 followed by the code point's upper five bits, and it should be followed by a byte starting with 10 followed by the code point's lower six bits.
U+0800 to U+FFFF: 1110xxxx 10xxxxxx 10xxxxxx: These code points fit into 16 bits and are encoded using three bytes. The first encoded byte should start with the bits 1110 followed by the code point's upper four bits, followed by a byte starting with 10 followed by the code's next six bits, followed by a byte starting with 10 followed by the code's lower six bits.
U+10000 and up: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: These code points fit into 21 bits and are encoded using four bytes. The first encoded byte should start with the bits 11110 followed by the code point's upper three bits, followed by a byte starting with 10 followed by the code's next six bits, followed by another byte starting with 10 followed by the code's next six bits, followed by a byte starting with 10 followed by the code's lower six bits.

Because of its basic compatibility with ASCII, and because it generally stores information in compact form, UTF-8 is used heavily for communication between computers and for disk storage. It is very difficult to use for representing text in memory, though, where programs often want to process text as a random-access array of characters. Providing random-access character access is most convenient using a sequence of 32-bit values (UTF-32), though UTF-16 more often offers the preferred tradeoff between performance and space utilization.

`A`	is 10₍₁₀₎	`D`	is 13₍₁₀₎
`B`	is 11₍₁₀₎	`E`	is 14₍₁₀₎
`C`	is 12₍₁₀₎	`F`	is 15₍₁₀₎

100	is represented as 01100100
−100	is represented as 10011100

10	is represented as 00001010
−10	is represented as 11110110