COMP 210 Lab 9: Timing Sorting Algorithms

Because of the exam and midterm break, we haven't covered any new material since last lab. So, this week's lab goes on a tangent and looks as some efficiency issues of algorithms we've seen.

In class, we've seen several different ways of sorting numbers. Now, you will measure and compare the performance of these sorts on different inputs and explain the data. We'll focus on three algorithms:

insertion sort (iSort)
mergesort (mSort), and
quicksort (qSort).

With three fundamentally different approaches to sorting a list, it seems unsurprising that they may behave differently. Can we observe this difference? Can we provide explanations? Is one of them the best way? (What exactly might "best" mean?)

If your programs only sort small lists a few times, it doesn't matter; any sort that is easy to write and have correct is fine. However, for longer lists, the difference really is huge. In Real Life (tm), sorting is an operation often done on very long lists (repeatedly). Only after we've decided that sorting is critical to having our programs running reasonably do we worry about which sorting method is fastest.

Using lab code
In DrScheme, open the file `/home/comp210/Labs/Lab09/sorts.ss`, and then Save-as... the file in your own lab directory (so you can modify it).

This file contains several sorting functions.

     ;; iSort: list-of-numbers --> list-of-numbers
     ;; mSort: list-of-number --> list-of-number
     ;; qSort: list-of-numbers --> list-of-numbers
     ;; sSort2: list-of-numbers --> list-of-numbers
     ;; sSort1: list-of-numbers --> list-of-numbers

These are insertion sort, mergesort, quicksort, selection sort (two-pass), and selection sort (one-pass). It also contains several easy functions to build a list, which we've seen before:

     ;; nums-down: natnum --> list-of-num
     ;; nums-up:   natnum --> list-of-num
     ;; nums-rand: natnum --> list-of-num
     ;;
     ;; Return a list of n descending, ascending, or random numbers.

Finally, there is also a built-in function time, e.g.,

     (time (iSort (nums-rand 20)))

     cpu time: 30 real time: 33 gc time: 0
     (list ...)

where

cpu time: # of milliseconds spent executing your code.
real time: # of milliseconds spent overall (including all other work being done -- monitoring web connections, displaying the system clock, reading incoming email ...)
gc time: # of milliseconds spent garbage-collecting (i.e., cleaning up old numbers and cons's not being used any more)

Some initial timings and discussion
Make sure you can run a couple of the sort functions, mentioned above, and time them as well, e.g., (time (iSort (nums-rand 20))) If we sort a really big list, it's annoying that the information we want (the times), is shown before the information we don't want (the sorted list). To reduce the amount of output we don't care about, use something like the following: (list? (time (iSort (nums-rand 20)))) which always returns true. Record how long it takes to call `iSort` on a list of 1000 random numbers. Note these times for the next step. Discuss: Look at the code for `iSort` (and its helper `insert`) -- which is the initial, naive attempt at sorting, just blindly following the template. Can we tweak the code at all, to get a performance improvement? In 2 minutes, make as many of these tweaks as you like, and then run your modified `iSort` on 1000 random numbers. Discuss: What percentage speedup results? How big a gain is this, in a large program which spends 0.1% of its time sorting? (E.g., Southwest Bell, which sorts its list of phone numbers, but only updates new entries once a day?) How big a gain is this, in a large program which spends 20% of its time sorting? (E.g., MarioSuperHappyWorld, which spends a lot of time sorting the list of all surfaces and objects near Mario, but also spends a lot of its time drawing the objects and calculating physics and turning turtles belly-up.) Is your code as easy to understand? (What if somebody else is assigned to update your code in a couple weeks?) 1st stricture against pre-emptive optimization: Don't optimize until you know you need to. Studies show that expert programmers with years of experience are not very good at guessing where a program's bottlenecks are. In the real world, write your program for clarity and maintainability first; this saves hundreds of person-hours during the development and debugging phases. Only once you determine your program is too slow, do you then go back and profile your program, finding the actual bottlenecks, and hand-tweaking those. Myself, I am repeatedly surprised by how often the naive, straightforward-but-dumb approach is plenty fast enough, and my program doesn't run noticeably slowly. (In the full PLT language levels, you can set your language options to track profiling information, and choose "show profile" from a menu!) Time `qSort` on a list of 1000 random numbers. What perspective does this give you, on the gains of tweaking `iSort`? 2nd stricture against pre-emptive optimization: Consider significantly different approaches, before tweaking. Another approach is selection sort: Select the smallest item in a list, recursively sort all-but-the-smallest, and then cons that smallest item on the front. This is easy, if you take a 2-pass approach (see `sSort2`): make one call to find the minimum, and another separate call to remove it from the list. When writing this, one realizes that rather than making two passes over the list, this can be done in one single pass, although the code is significatnly more complicated: `sSort1`. How much do we gain with our extra-fancy one-pass approach? Compare timings of `sSort1` vs `sSort2`. How do they compare? I was surprised at the result! But it drives home: 3rd stricture against pre-emptive optimization: More complicated fancy code can easily be slower code!

Okay, now we're set! Divide your lab group into three different groups (Team iSort, Team mSort, Team qSort); your group is in charge of running one of the sorting algorithms on lists of size 500, 1000, and 2000, for each of the types of data (ascending, descending, random).

Your labby will write a table on the whiteboard; as you finish the measurements, write your results on the whiteboard.

Timing results
		500
		input size
Insertion sort	Up
	Down
	Random
Mergesort	Up
	Down
	Random
Quicksort	Up
	Down
	Random

Discuss: What real-world issues suddenly are of concern, when gathering this data?
Which algorithm is the best? What does "best" mean? Which algorithm might be best when your data happens to be mostly-sorted? (Consider a bank, which has a list of your checks already sorted by transaction date, and then wants to sort them by check-number. Or consider sorting a pile of graded exams by score -- the scores might be correlated (or, anti-correlated) with the order in which they were turned in!)
Explain the results. Thinking about the code (doing imaginary hand-evaluation in your head):
- Do you understand why iSort does well on certain kinds of lists?
- Considering random vs descending lists -- can you find a quantifiable relation between these results? Does thinking about the algorithm help explain this?
- When you double the size of the input, what seems to happen to the running time? (You may want to verify your hypothesis by running lists of size 4000 or 250.) What is the running-time, as a function of n (up to a constant factor)? See an analysis of insertion sort.
- Similarly, discuss the performance of, try to explain, and try to quantify the running-time of mergesort and quicksort.

Note that the algorithms, particularly quicksort, would have been faster if the data were stored not in a list, but in a data structure where you can directly access any element (i.e., vectors). This is true of many algorithms.

Big-Oh (optional)

Can we get a more systematic idea of how fast these different sorting algorithms run?

So far, we've tried to get a gut feeling for the behavior of algorthims. However, there is "science" in computer science; it is important to be able to formalize these notions, without any hand-wavy arguments. This is done with the ``order-of'' concept, which has a formal description.

E.g., we say "function f(n) is order of n²" to mean "f(n) < n²" along with some fine print: "up to some constant factor, and perhaps ignoring small values of n".

"Function f(n) is order of n²" is abbreviated by saying "f(n) is in big-Oh of n²", which is written "f(n) is in O(n²)".

Of course, there is nothing special about n²; this generalizes to any function.

The Nitty-gritty

Enough hand-waving. What do we mean "up to some constant factor", and "small values of n"? Don't worry, there is a technical definition:

Definition: A function f(n) is in O(g(n)) if there exist numbers c and bigEnough such that for all n ≥ bigEnough, f(n) ≤ c×g(n).

(A technical note on abusage of this definition).

Finally, we can say that mergesort runs in time O(n log n), and insertion sort and quicksort in time O(n²). This refers to how long these algorithms take in the worst case. (more on big-Oh notation)

(We can also talk about running times in the "average" case, i.e., as an average over all possible inputs. Quicksort runs in time O(n log n) average case. Quicksort's worst case time can be improved to this bound as well if we use vectors and a better way of picking the pivot.)

Later courses, particularly COMP 212, 280, 314, and 482 analyze algorithms in more detail.