Sorting

Contents:
1. Selection sort
2. Insertion sort
3. Comparing efficiency
4. Merge sort

Sorting a list is one of the most fundamental problems in computer science. The basic problem is this: Suppose we have a list of numbers.

12 1 13 5 2

We want to reorder the list so that the numbers appear in increasing order.

1 2 5 12 13

Sorting is fundamental to computer science for several reasons: First, real programs often need to sort data; second, sorting techniques prove to be a foundation for many other computational problems; and finally, the problem is relatively simple to analyze, but still complex enough to have some worthwhile intricacies. Thus, it is a perfect showcase problem for study and research.

1. Selection sort

There are many reasonable ways to approach the problem of sorting an list. Most of the obvious ones are also relatively easy to analyze. Take, for example, the selection sort algorithm. In this approach, we find the smallest element, and swap it into the first position.

12 1 13 5 2 ⇒ 1 12 13 5 2

Then we find the smallest element right of the first position and swap it into the second position.

1 12 13 5 2 ⇒ 1 2 13 5 12

Then we find the smallest element right of the second position and swap it into the third position.

1 2 13 5 12 ⇒ 1 2 5 13 12

We continue doing this, each time determining the proper number to place in the next position of the list, until we've completed the list.

1 2 5 13 12 ⇒ 1 2 5 12 13

Writing this in a program is relatively straightforward.

def selection_sort(data): for i in range(len(data) - 1): minval = data[i] minloc = i for j in range(i + 1, len(data)): if data[j] < minval: minval = data[j] minloc = j data[minloc] = data[i] data[i] = minval

We use a variable i to track which position we are currently trying to swap into. To determine where we should swap from, we use a variable j that steps through every index above i, and every time we find a smaller element, we change min to refer to that element's index. After going through the inner loop over j's, then, min will hold the index of the smallest number in position i or beyond. This is the position that we swap with position i. Then we can proceed to the next i.

2. Insertion sort

Insertion sort is an alternative sorting algorithm. For it, we keep the first segment of the list sorted, and we slowly grow this segment, element by element, until it encompasses the entire list.

12 1 13 5 2
Segment starts with first element only.

1 12 13 5 2
Then we grow it to the first two elements.

1 12 13 5 2
Then the first three elements.

1 5 12 13 2
Then the first four elements.

1 2 5 12 13
… until the segment covers the entire list.

Each time we want to expand the segment by an element, the only thing we need to do is to insert the new element into the already sorted list — hence the name insertion sort. The insertion process involves shifting the elements of the old segment that are greater than the new element up by one position, to make room for the new element's proper position.

To implement this in a program, we need a variable to keep track of how large the current segment is; we'll use i for this. Each iteration, our goal is to increment i by one, which requires us to insert element i into the first i elements. To shift elements upward to make room, we'll use another variable, j, which will start at i − 1 and move down until element j is less than element i.

def insertion_sort(data): for i in range(1, len(data)): t = data[i] j = i - 1 while j >= 0 and data[j] > t: data[j + 1] = data[j] j -= 1 data[j + 1] = t

3. Comparing efficiency

So we now have two very different approaches to sorting a list. Is one “better” than another in any way? The natural thing to ask is whether one turns out to be faster.

We could imagine simply timing how quickly each runs on a computer, but of course that just gives us a measurement without giving us much evidence of why one does better or even that we'd get the same result if performed on a different computer. Rather than such experimental evidence, then, for the moment we'll concentrate on a bit of theoretical analysis.

In particular, we'll consider the question: How often does each algorithm go through the process of comparing elements from a list containing n values? The answer to this question turns out to correlate quite well to actual performance.

Looking at selection sort first, we begin with the “first phase” where i is 0, in which we want to find the smallest element. That involves a loop over values of j from 1 to n − 1, where each iteration performs one comparison of elements from the list. So the first phase includes n − 1 comparisons.

In the “second phase,” when i is 1, we loop over values of j from 2 to n − 1, so there will be n − 2 comparisons. We have n − 3 comparisons in the third phase, n − 4 comparisons in the fourth phase, and so on until the last phase where there is just one comparison.

Looking at all phases, then, the total number of comparisons is

(n − 1) + (n − 2) + (n − 3) + … + 2 + 1

Many students can promptly recognize this sum of all the integers from 1 up to something and apply a formula to conclude that the sum ½ n² − ½ n. If you don't know the formula, though, then you could observe for yourself that if you simply add the same series to itself in reverse, you'd find yourself adding n − 1 to 1, adding n − 2 to 2, and so on up to adding 1 to n − 1. Each pair sums to n, and there are n − 1 pairs, so the sequence added to itself is n (n − 1) = n² − n. So just one copy of the sequence must be exactly half of that.

Now, what about insertion sort?

In the first phase, there is just one comparison.

In the second phase, there are either one or two comparisons: It depends on whether the number being inserted falls above the first number to which it is compared. If the list is shuffled randomly, there is a 1/3 chance that it belongs above the first number to which it is compared, so the expected number of comparisons is (1/3) 1 + (2/3) 2 < 2. (You see the less-than operator there at the end? That's because we're going to overestimate the number of comparisons slightly so that the numbers work out more nicely. This is a bit lazy, but the fact is that weakening our argument for insertion sort's efficency just a bit doesn't really have much of an effect on our argument.)

For the third phase, the expected number of comparisons is (1/4) 1 + (1/4) 2 + (2/4) 3 < 2½. Overall, there are n − 1 phases; the total expected number of comparisons among them is less than 1½ + 2 + 2½ + … + (n + 1)/2 which is the same as ½ (3 + 4 + 5 + … + (n + 1), which is less than ½ n² + 2 n.

For large n, selection sort's n² − n is roughly twice as much as ½ n² + 2 n, so we'd conclude that insertion sort is roughly twice as fast. The savings comes from the fact that in each stage, we would “typically” move the element halfway forward, whereas selection sort always goes through the entire sub-list.

One difference from selection sort is that insertion sort will sometimes go faster than at other times. As an extreme example, for lists already in sorted order, there will be never be any iterations of insertion sort's inner loop, and so insertion sort will take O(n) time. This is unusual, though. For most orderings, insertion sort takes O(n²) time. Still, since there's a chance that it may go faster, programmers generally prefer insertion sort over selection sort.

4. Merge sort

The selection sort and insertion sort algorithms, along with most other intuitive techniques, take O(n²) time. It turns out, though, that a different algorithm called merge sort does better.

The merge sort algorithm is based on the idea of recursion. In particular, given an list, we will first recursively sort the first and second halves of the list separately. Then we will merge the two halves together to attain our final result.

1. We start with an list.
5 17 9 12 4 2 10 14

2. Divide the list into two halves.
5 17 9 12
and
4 2 10 14

3. Recursively sort each half.
5 9 12 17
and
2 4 10 14

4. Merge the sorted halves.
2 4 5 9 10 12 14 17

Implementing the merge sort algorithm as below is really not too complex.

def merge_sort(data): if len(data) > 1: mid = len(data) // 2 # divide list into two halves a = data[:mid] b = data[mid:] merge_sort(a) # sort each half merge_sort(b) ai = 0 # merge the sorted halves bi = 0 for k in range(len(data)): # pull into each slot of result if bi >= len(b) or (ai < len(a) and a[ai] < b[bi]): data[k] = a[ai] # pull from a ai += 1 else: data[k] = b[bi] # or pull from b bi += 1

Dividing the list into two and recursively sorting both halves is simply a matter of finding where to split the segment (index mid) and making the recursive calls. The bulk of the implementation is spent in the merging. Here, we use two indices, i and j, referring to how far we've merged elements from the two half-segments. We copy elements from the half-segments into another list, each time incrementing i or j, until the other list is filled. Finally, we copy all the elements from the merged list back into the original.