Comp 210 Lab 10: Timing Sorting Algorithms

In class, we've seen several different ways of sorting numbers. Today's lab is involved with measuring and comparing the performance of these sorts on different inputs, and explaining the data. We'll focus on three: insertion sort (``isort'') mergesort (``msort''), and quicksort (``qsort'').

So we have three fundamentally different approaches to sorting a list. It seems unsurprising that some might behave differently than others. Can we observe this difference? Can we provide explanations? Is one of them the best way? (What exactly might "best" mean?)

If your programs only sort small lists a few times, it doesn't matter; any sort that is easy to write and have correct is fine. However, for longer lists, the difference really is huge. In Real Life (tm), sorting is an operation often done on very long lists (repeatedly). Only after we've decided that sorting is critical to having our programs running reasonably do we worry about which sorting method is fastest.

Making Test Lists

to do: In drscheme, open the file /home/comp210/Labs/Lab10/sorts.ss, and then save-as... the file in your own lab directory (so you can modify it).

This file contains several sorting functions.

;; iSort: list-of-numbers --> list-of-numbers
;; mSort: list-of-number --> list-of-number
;; qSort: list-of-numbers --> list-of-numbers
;; sSort2: list-of-numbers --> list-of-numbers
;; sSort1: list-of-numbers --> list-of-numbers

These are insertion sort, mergesort, quicksort, selection sort (two-pass), and selection sort (one-pass). It also contains several easy functions to build a list, which we've seen before (and which you could write yourself in about 60seconds each, if needed):

;; nums-down: natnum --> list-of-num
;; nums-up:   natnum --> list-of-num
;; nums-rand: natnum --> list-of-num
;;
;; Return a list of n descending, ascending, or random numbers.

Finally, there is also a built-in function time:

(time (iSort (nums-rand 20)))
   cpu time: 30 real time: 33 gc time: 0
(list ...)

cpu time: # of milliseconds spent executing your code.
real time: # of milliseconds spent overall (including all other work being done -- monitoring web connections, displaying the system clock, reading incoming email ...)
gc time: # of milliseconds spent garbage-collecting (i.e. cleaning up old numbers and cons's not being used any more)

To do

Make sure you can run a couple of the sort functions, mentioned above, and time them as well: (time (iSort (nums-rand 20))) If we try sorting a list of 200 or 2000 instead, it's kind of annoying that the timing information is printed, and then drscheme prints the list of hundreds of numbers.
Helpful hint: call some function like list? with your timing call -- the answer will always be true, but we didn't really care about the sorted list anyway. (Should this list? be inside the time or outside? What else might be done outside the time?)
Record how long it takes to call iSort on a list of (say) 1000 random numbers. Note these times for the next step.
Look at the code for iSort (and its helper insert) -- which is the initial, naive attempt at sorting, just blindly following the template. Can we tweak the code at all, to get a performance improvement?
- ??
- ??
- ??
- ??
You have 2min: make as many of these tweaks as you like, and then run your modified iSort on 1000 random numbers. What percentage speedup results? Is your code as easy to understand? (What if somebody else is assigned to update your code in a couple weeks?)

Discuss: How big a gain is this, in a large program which spends 0.1% of its time sorting? (E.g. Southwest Bell, which sorts its list of phone numbers, but only updates new entries once a day?) How big a gain is this, in a large program which spends 20% of its time sorting? (E.g. MarioSuperHappyWorld, which spends a lot of time sorting the list of all surfaces and objects near Mario, but also spends a lot of its time drawing the objects and calculating physics and turning turtles belly-up.)

1st stricture against pre-emptive optimization: Don't optimize until you know you need to.
Studies show that expert programmers with years of experience are not any good at guessing where a program's bottlenecks are. In the real world, write your program for clarity and maintainability first; this saves hundreds of person-hours during the development and debugging phases. Only once you determine your program is too slow, do you then go back and profile your program, finding the actual bottlenecks, and hand-tweaking those.

Myself, I am repeatedly surprised by how often the naive, straightforward-but-dumb approach is plenty fast enough, and my program doesn't run noticeably slowly.

(In the full PLT language levels, you can set your language options to track profiling information, and choose "show profile" from a menu!)
Time qSort on a list of 1000 random numbers. What perspective does this give you, on the gains of tweaking iSort?
2nd stricture against pre-emptive optimization: Consider significantly different approaches, before tweaking.
In class, we discussed selection sort: Select the smallest item in a list, recursively sort all-but-the-smallest, and then cons that smallest item on the front.

This is easy, if you take a 2-pass approach (see sSort2): make one call to find the minimum, and another separate call to remove it from the list. When writing this, one realizes that rather than making two passes over the list, this can be done in one single pass (although the code is significatnly more complicated, as seen in class and sSort1.)

How much do we gain with our extra-fancy one-pass approach? Compare timings of sSort1 vs sSort2. How do they compare? I was surprised at the result! But it drives home:
3rd stricture against pre-emptive optimization: More complicated fancy code can easily be slower code!

Comparing Sorts

To do: Okay, now we're set! Divide your lab group into three different groups (Team iSort, Team mSort, Team qSort); your group is in charge of running one of the sorting algorithms on lists of size 500, 1000, and 2000, for each of the types of data (ascending, descending, random).

Your labby will write a table on the white-board; as you finish the measurements, write your results on the whiteboard.

What real-world issues suddenly are of concern, when gathering this data?

input size

500

1000

2000

Insert sort

up
down
rand

up
down
rand

up
down
rand

Mergesort

up
down
rand

up
down
rand

up
down
rand

Quicksort

up
down
rand

up
down
rand

up
down
rand

(To be fair, quicksort does somewhat better if you have data stored not in a list, but in a data structure where you can access any element directly (vectors); this is often true in many scientific applications.)

Analyzing the Results

To do: Discuss Which algorithm is the best? What does "best" mean? Which algorithm might be best when your data happens to be mostly-sorted? (Consider a bank, which has a list of your checks already sorted by transaction date, and then wants to sort them by check-number. Or consider sorting a pile of graded exams by score -- the scores might be correlated (or, anti-correlated) with the order in which they were turned in!)

To do: Explain the results. Thinking about the code (doing imaginary hand-evaluation in your head):

do you understand why iSort does anomolously well on ascending lists?
Considering random vs descending lists -- can you find a quantifiable relation between these results? Does thinking about the algorithm help explain this?
When you double the size of the input, what seems to happen to the running time? (You may want to verify your hypothesis by running lists of size 4000 or 250.) What is the running-time, as a function of n (up to a constant factor)? See Analyzing Insertion Sort.
Similarly, discuss the performance of, try to explain, and try to quantify the running-time of See Merge Sort and Quick Sort.

Big-Oh (optional)

Can we get a more systematic idea of how fast these different sorting algorithms run?

So far, we've tried to get a gut feeling for the behavior of algorthims. However, there is "science" in computer science; it is important to be able to formalize these notions, without any hand-wavy arguments. This is done with the ``order-of'' concept, which has a formal description.

We say a "function f(n) is order of n²" to mean ``f(n) < n²'' along with some fine print: ``up to some constant factor, and perhaps ignoring small values of n''.

"Function f(n) is order of n²" is abbreviated by saying "f(n) is in big-Oh of n²", which is written "f(n) is in O(n²)".

Of course, there is nothing special about n²; this generalizes to any function.

The Nitty-gritty

Enough hand-waving. What do we mean "up to some constant factor", and "small values of n"? Don't worry, there is a technical definition:

Definition: A function f(n) is in O(g(n)) if there exist numbers c,bigEnough such that for all n >= bigEnough, f(n) <= c * g(n).

(A technical note on abusage of this definition).

Finally, we can say that merge-sort runs in time O(n log n), and insertion sort and quicksort in time O(n²). This refers to how long these algorithms take in the worst case. (more on big-Oh notation) (One might also want to talk about the average case, but it can sometimes be difficult even defining what an average list is!)

(We can also talk about running times in the "average" case, i.e., as an average over all possible inputs. Quicksort runs in time O(n log n) average case. Quicksort's worst case time can be improved to this bound as well if we use vectors and a better way of picking the pivot.)