Comp 210 Lab 6: Comparing Sorting Algorithms; Parsing

Table of Contents

  • Comparing Sorting Algorithms
  • Big-Oh
  • Parsing
  • Comparing Sorting Algorithms

    We have seen three different sorts of sorts in lecture, labs, and homework:
  • Insertion sort: After figuring how to add one new element to an already-sorted list, you create a sorted list by repeatedly removing an element from the input, and inserting into the sorted output list (which started as null). (See code.)
  • Merge sort: Start with each number in its own "pile"; repeatly merge adjacent piles (doubling the size of each pile at each merge step). Since each pile is kept sorted, merging two of them is not so difficult. (See code.)
  • Quicksort: As seen in lecture, you divide the list into all elements smaller than the car, all larger than the car, sort each of those, and you're done after appending them. (See code.)
  • So we have three fundamentally different approaches to sorting a list, all of which seem reasonable. Which one do we actually want to use?

    If your programs only sort small lists a few times, it doesn't matter. However, sorting is an operation that sometimes is done repeatedly, or on very long lists. Only after we've decided that sorting is critical to having our programs running reasonably do we worry about which sorting method is fastest.

    Making Test Lists

    In order to compare these different sorting algorithms, let's create some large lists of numbers to try them on.

    Write a function nums-up, where (nums-up 7) returns an ascending list of 7 numbers, say (1 2 3 4 5 6 7).

    Now write a function nums-down which similarly creates a descending list of any length.

    Now write a function nums-rand, which similarly creates a list of random numbers in the range from 0 to 32767. To create a single random numbers in the range from 0 to 9, you could use (random 10). For larger numbers, try:

    (define max-rand 32768)
    (random max-rand)
    
    (The function random is built-in; the value 32768 happens to be the biggest useful number to pass to that function. But this could conceivably change in the future; using the place-holder max-rand in your code makes it both easier to read, and easier to maintain.)

    Note that random is certainly a function in the Scheme sense, but not in the mathematical sense (since the output is not determined by the input).

    (Hey... all three functions you just wrote seem rather similar. How would you write a single function make-list which abstracts the similar parts of the functions, and then use make-list to write all the above three functions as one-liners?)

    Timing

    There is a built-in function time-apply, which can be used to time how long it takes the computer to evaluate something.

    Although we'd like to be able to (time-apply (qsort big-list)), where big-list is some previously-defined list. However, time-apply doesn't quite work like this. You must give time-apply a function of zero arguments. That is, something like (time-apply qsort-big-list), where qsort-big-list takes no arguments, and predictably sorts big-list. (Remember, big-list must be previously defined.) How can we write this function? Try it! (You may want to have qsort-big-list do the work of sorting, but then always return just #t, since we don't really want to see the sorted list printed out--it will be long.) Here's our solution.

    (time-apply reports how much two numbers, both are times in milliseconds. The second number is the actual time elapsed. The first number is the cpu time--that is, how much time Dr. Scheme itself used up, as opposed to time spent on your Netscape program, updating any mouse movements you made, handling any mail requests coming in, etc.

    Note that the time might vary for repeated evaluations of the same expression. This is because there might be different times spent waiting to load parts of Dr. Scheme from disk or cache memory, figuring out what memory is no longer being used and can be recycled, etc. You should time the same function several times to get a better idea of how much time is typically required.

    Okay, now we're set! Your section leader will divide you up into three different groups (Team iSort, Team mSort, Team qSort); your group is in charge of running one of the sorting algorithms on lists of size 1000, 2000, and 10000, for each of the types of data (ascending, descending, random). For instance, your task might be to write a function my-taskthat takes no arguments and calls mSort on a list of 1000 ascending numbers, and finding (time-apply my-task).

    How would you make a function which takes a list as an argument (we'll it nums), and returns:
    A function which takes no arguments, and calls mSort on the list nums. Hint: use local. Try it!

    On the board, your section leader will tally all the results. Which algorithm appears to be quickest for random data? Is this also the best sort to use if your data happens to be mostly-sorted?

    (To be fair, quicksort does somewhat better if you have data stored in a different manner (bounded lists), which is often true in many scientific applications...not to mention Microsoft's accounting software, as alluded to in class.)

    Big-Oh

    Can we get a more systematic idea of how fast these different sorting algorithms run?

    Thinking about insertion sort

    Consider insertion sort: To sort a list of n items, it does the following:
  • insert an item into the empty list,
  • insert an item into a list of length 1,
  • insert an item into a list of length 2,
  • ...
  • insert an item into a list of length n-1.
  • Approximately how many steps, at most, could it take to insert an item into a list of length k?

    As a crude overestimate, consider insertion sort as doing n repetitions of the task "insert an item into t alist of length at most n". So roughly, you could say that insertion sort, given a list of n items, takes about n^2 steps.

    Of course you don't know how long it takes the computer to do exactly one 'step' (a term we are using fuzzily), but do the times of insertion sort suggest that, in the worst case, indeed it takes on the order of n^2 time?

    Thinking about merge sort

    Now consider merge sort. To sort a list of n items, it made n stacks of size 1, and then
  • merge adjacent pairs of size-1 stacks into size-2 stacks;
  • merge adjacent pairs of size-2 stacks into size-4 stacks;
  • merge adjacent pairs of size-4 stacks into size-8 stacks;
  • ...
  • merge adjacent pairs of size-n/2 stacks into a size-n stack.
  • How many steps is this, in terms of n?
    Within each step, observe that every stack was ruffled through once, which means exactly n items were looked at. Thus, we have lg (n) merge steps, each of which takes proportional n steps. (lg means logarithm base-2; read here for justification of why the particular base of the logarithm doesn't really matter for our purposes.)

    So roughly, merge-sort takes on the order of n logn steps, up to some constant factor. Do the timings support this estimate?

    Formalizing "order-of"

    Computer Science has a formal notion of the above hand-wavy "order-of" arguments. We say a "function f (n) is order of n^2" to mean "f (n) < n^2" along with some fine print "up to some constant factor, and perhaps ignoring small values of n".

    "Function f (n) is order of n2" is abbreviated by saying "f (n) is in big-Oh of n2", which is written "f (n) is in O (n2)".

    Of course, there is nothing special about n2; this generalizes to any function.

    Enough hand-waving. What do we mean "up to some constant factor", and "small values of n"? Don't worry, there is a technical definition:

    Definition: A function f (n) is in O (g (n)) if:
    there exist numbers c, bigEnough such that
    for all n >= bigEnough,
    it is true that f (n) <= c g (n).

    (A technical note on abusage of this definition).

    Finally, we can say that merge-sort runs in time O (n log (n)), and insertion sort in time O (n2). This refers to how long these algorithms take in the worst case. (Though they are also just upper bounds, we usually try to make the strongest statement we can. Thus saying merge-sort runs in time O (n3) is true, but that's not a very strong statement.)

    Can you give an upper bound for the running time of quicksort? Support your answer.

    Parsing

    In lecture, you saw an example of parsing input: the example of a graph which was given to you as one list of symbols. The process of taking a stream of words, and internalizing them into a meaningful format, is called parsing.

    Parsing is a common task. Consider a calculator: it gets a stream of key-presses

    7 'plus 23 'times 'openParen 2 'minus 1 'closeParen
    
    and must represent them internally, for instance to group the expression inside parentheses together.

    Actually, the calculater must do even more: it must parse the keypresses 2 3 'dot 4 to even get the single number 23.4.

    Your task: write a function which takes a list of key presses, and returns returns the same list but with sequences of digits turned into individual numbers. Ignore any decimal points. For example, given (parse-nums (list 2 3 'times 0 'minus 1 0 0)) should return (list 23 'times 0 'minus 100)

    Proceed in a top-down fashion: we start writing our function, and when we realize we'll need a helper function, we'll just assume it for now, and write it later.

    (define parse-nums
    ;; (parse-nums in)
    ;; given a list of input keystrokes "in",
    ;; return the same list but with consecutive
    ;; digits turned into numbers.
    ;;
    ;; Example:
    ;; (parse-nums (list 2 3 'times 0 'minus 1 0 0))
    ;; =
    ;; (list 23 'times 0 'minus 100)
    ;;
    (define parse-nums
      (lambda (in) 
        (cond ((null? in) null)
    	  ((nonDigit? (car in))
    	   (cons (car in)                  ;; just pass the car through
    		 (parse-nums (cdr in))))
    	  (else                            ;; we have a digit: grab it!
    	   (cons (get-one-num in)
    		 (parse-nums (remove-one-num in)))))))
    
    
    ;; (digit s)
    ;; return #t iff s is a digit: in 0, ..., 9
    ;;
    (define digit?
      (lambda (s)
        (and (number? s)
    	 (>= s  0)
    	 (<  s 10)
    	 (exact? s))))
    
    (define nonDigit?
      (lambda (s)
        (not (digit? s))))
    

    Exercise: write get-one-num and remove-one-num. Will either of these need an accumulator? Test your function on several inputs. Only then try applying parse-nums to your test input.

    If you finish with that, modify your code to parse numbers with a decimal point : (list 2 3 'dot 4 'times 0 'dot 0 'minus 1 0 0) should return (list 23.4 'times 0.0 'minus 100) or something equivalent.

    Entirely Optional: parsing calculator expressions.


    Back to Comp 210 Home