Comp 210 Lab 11: Strings in C

To prepare for this lab, create a subdirectory, and copy files into it as shown:
cd ~/comp210
mkdir lab11
cd lab11
cp ~comp210/Labs/lab11/*.c  .

Table of Contents

  • Representing Strings
  • strLen
  • The Truth About *
  • Representing Strings

    The term "string" refers to a bunch of characters strung together: e.g., "hello", "?" and "Caught you red-handed!". C handles strings as vectors of characters, which isn't entirely crazy. For example, if msg is a string variable whose contents are "hello", then msg[0] == 'h', and msg[4] == 'o'. (Note that use of single quotes for characters, and double-quotes for strings.)

    So immediately, you realize that if I declare msg as a vector of (say) twenty characters, this imposes a limit on how large a string msg can hold. However, you wouldn't expect to have any trouble storing short strings in the vector:

    char msg[20];
    ...
    msg[0] = 'h';
    msg[1] = 'i';
    msg[2] = '!';
    
    But this raises a question: How do the women inside the computer know that locations zero through two contain "interesting" information, and the rest of the 20 locations are currently junk? In C, the convention is as follows: The end of the interesting data of a string is marked by a special "null" character. The null character is '\0'. (Despite how '\0' looks on the screen, it's only one character after your program gets compiled. Compare it to the newline character, which is written '\n'.)

    That's all there is to strings: null-terminated vectors of characters. We'll spend the next half hour looking at how to use them.

    So what do we need to do so that msg has the value "hi"?

    char msg[20];
    ...
    msg[0] = 'h';
    msg[1] = 'i';
    msg[2] = '!';
    msg[3] = '\0';
    
    How many locations are needed to store this three-letter string? What is the longest string which can be stored in msg, as declared above?

    Write a short program which declares a string variable, gives it a value, and prints it out. To print strings, use printf. But since we're not printing a decimal number, don't use %d; instead use %s for strings. For example, printf( "She said %s, she did.\n", msg );. Now add the instruction msg[3] = '?', and see what happens. This shows how printf follows the protocol of presuming that '\0' marks the end of the interesting data.

    Recall how we can either declare vectors of an explicit size (like 20, above), or we could implicitly give them a size by specifying initial values. Naturally this is also true for vectors of characters.

    char msg[]     = { 'h', 'i', '!', '\0' };    // vector of size 4
    char sameMsg[] = "hi!";                      // shorthand for previous
    
    But strings are common enough that C recognizes things in double-quotes as strings, and automatically provides the terminating null character for you. So sameMsg[3] == '\0', and sameMsg[4] is a bad memory reference.

    Writing strLen()

    In previous weeks we've already written functions which operate on vectors. In that spirit, let's write a function strLen which takes a vector s of characters, and returns how long s is, interpreted as a null-terminated string. For instance, strLen( "hello" ) returns five. Note the capital letter in the name; this is to avoid conflicting with the built-in function strlen, which (curisouly enough) happens to do the exact same thing as strLen.
    /*  This file is ~comp210/Labs/lab11/strLen.c
     */
    
    #include <stdio.h>
    
    
    int strLen( char s[] ) {
      int i = 0;
      while (s[i] != '\0') {
        i = i + 1;
        }
    
      // If location i is the terminating null,
      // then locations 0..i-1 comprise i (interesting) characters in s.
      //
      return i;
      }
    
    
    int main() {
      char s1[] = "  Going ... going ... gone!";
      printf( "There are %d letters in <<%s>>.\n", strLen(s1), s1 );
    
      char *s2;  // Think "char s2[];"    What does this cause the compiler to do?
    
      s2 = s1;
      s2[2] = 'B';
      s2[7] = '\0';
      printf( "There are %d letters in <<%s>>.\n", strLen(s1), s1 );
    
      return 0;
      }
    

    Now look closely at the main, and compare it to the program's output. Observe that s2 = s1 results in intentional equality: changing s2 also changes s1--they are identically the same strings. Hmm, this is just how structures behaved. Since vectors were introduced as "structures with uniform access", this result may not have surprised you. Arrays, like structures, are evaluated as "hat" variables. We'll explain this more momentarily.

    What if we don't want intensional equality, but extensional? That is, I want s2 to get a copy of s1, instead of becoming different names for the same string.

    If we change the above declaration of s2 to char s2[50]; instead, what does the computer do?

    Exercise: Copy the file ~comp210/Labs/lab11/strCpy.c, and complete it by writing the function strCpy which takes two arguments char dst[], char src[] and copies the contents of src into dst.
    Note how the order of the arguments parallels the assignment operator. (Again, be sure to name the function using an uppercase letter, as to avoid conflict with the built-in function strcpy.)
    What happens when you uncomment the line, which copies a large string into s1?

    The Truth About *

    The line
      char s2[] = "Hop hop hop";
    
    really has two parts: First, it creates a placeholder s2, of type vector-of-character (or in C terminology, "array of character"). Second, it creates a vector of 12 characters, and associates that particular vector as the value of s2.

    But how does the computer represent a vector of characters? As alluded to in lecture, the characters are stored in adjacent buckets in memory. So "Hop hop hop" might mean that memory location 3000 contains 'H', location 3001 contains 'o', ..., location 3010 contains 'p', and location 3011 contains the null character. Sensible enough.

    Now, if you were in charge of keeping track of s2's value, would you keep all twelve of these addresses in mind? An easier way would be to remember just the location of the beginning of s2, namely 3000. In fact, this is C really does! The variable s2 contains not an entire array, but really it contains just the address 3000. When the program then talks about s2[0] this is translated to "look up the value of s2 (3000), and go to that location in memory, and pull out what's there." Similarly, s2[4] translates to "look up the value of s2, that's the location where the vector starts; add 4 and go to that location (3004) in memory, and pull out what's there." (This is the truth, but there's also the whole truth.)

    It's this shortcut of representing vectors as the address where they start in memory that causes vectors to be treated as "hat variables". In fact, the effect of these hat variables can always be achieved by having them not refer directly to the data, but instead to the address where the data is held. Thus we consider "address of data" to itself be a valid type of data. ("Hat variable placeholders themselves count as values.")

    Note that we don't have to treat structures and arrays as hat values (pointers); however C and Scheme chose to do so because it makes it easier to pass them as arguments, and because intensional identity is often what we want in programs, as we've seen in the move-particle program for instance.

    So deep down, the computer says that s2 is reference to the array. (Sometimes the word pointer is used.) This now explains, why s2 = s1 resulted in intensional equality. The lab leader will hand evaluate that example, following the rules just revealed.

    It's worth noting that it's now more apparent why C lets s2[9999] or s2[-3] to work: it just naively goes to the position in memory where the array s2 starts, offsets by 9999 or -3, and returns whatever was in the indicated memory bucket. C lets you (try to) access any memory location like this; in this sense it's like the Jam2000 assembly language.

    Moreover, C has this way of using arrays so ingrained, that you could declare an array of (say) characters by instead declaring it as "a pointer to a memory location holding a character":

      char* s2;          // Think of this as "char s2[];"
      s2 = "Hop hop hop";
    
    (Even worse: C does not allow declaring an array of an unspecified size: the commented-out version is illegal, and you have to sometimes use a * to declare an array.)

    Of course, there is nothing special about vectors of characters -- this is true of any C vector. Moreover, everything said here pertains equally well to structures. Aha, that's why we used a * when declaring structures! Declaring struct foo_s* x;, this doesn't quite say that x is a struct foo_s, but rather that x points to a location in memory, where the actual structure resides.

    With this understanding, hand-evaluate the main function below. Notice how creating (declaring) a placeholder is separate from creating a vehicle structure.

    /* This is the file ~comp210/Labs/lab11/clone.c
     */
    
    #include <stdio.h>
    
    struct vehicle_s {
      int numWheels, idNum;
      float price;
      };
    
    
    /* clone
     * Takes a vehicle,
     * returns a vehicle which looks just like the original,
     * but is a different copy  (Intensionally different).
     */
    struct vehicle_s*  clone( struct vehicle_s* original ) {
      /* You right this! */
      }
    
    
    int main() {
      struct vehicle_s* trike = new( struct vehicle_s );
      struct vehicle_s* rig;
    
      trike->idNum = 1234;
      trike->numWheels = 3;
      trike->price = 49.99;
    
      rig = clone( trike );
      rig->numWheels = 6 * rig->numWheels;
    
      printf( "%d wheels on trike, and %d on rig.\n",
              trike->numWheels, rig->numWheels ); 
      return 0;
      }
    
    Write a function clone which takes a struct vehicle_s, and returns another struct vehicle_s which is a separate copy of the first. (This is just the struct version of strcpy.) It doesn't require anything fancy, but now it should make more sense exactly what the *s do in the code.

    Hand-evaluate the above code. In particular, pay attention to how rig and trike each get their values. Exactly how does this change, if we went back to saying rig = trike instead of cloning trike?


    Back to Comp 210 Home