Big picture:

xexpr, xml, and html

Information has structure. For instance, perhaps Picture-infos consists of a caption and a filename (ending in ".jpeg" or ".gif"); and in turn a gallery consists of a title, and a list of picture-infos. We have seen how to represent this data in scheme programs, but non-programmers also want to represent such information. X-expressions and xml are two different, equivalent ways of doing this. We will provide functions to translate back and forth between these formats.

So although your assignment will only process xexprs, by using the provided functions this will let you process data files using the xml format. Since html is just one type of xml, you will be able to write programs that take in and/or produce web pages. (Details on these functions forthcoming, though for the homework you won't need to use them.)

For example, here is some information in xml format, as might be written by a non-programmer photo archivist:

<gallery>
  <title>Me at the Britney Spears Concert</title>
  <picture>
    <filename>pict01.jpg</filename>
    <caption>Waiting in line for a Pepsi.</caption>
    </picture>
  <picture>
    <filename>pict07.jpg</filename>
    <caption>Waiting in line for <em>another</em> Pepsi.</caption>
    </picture>
  <picture>
    <caption>Waiting in line for the bathroom.</caption>
    <filename>pict19.jpg</filename>
    </picture>
  </gallery>

A few notes on this:

The structure of the data is indicated by matching start and end tags (such as <picture>...</picture>).
The word "another" is inside an em tag; we happen to intend tag to indicate "emphasized text".
In a sense, these tags are just made-up. You might be wondering, what are the legal tags allowed inside "gallery"? Well, I haven't precisely said, and will leave things to your intuition. For example, while a picture must have a caption and a filename, a caption does not need to contain an emphasized element.)
Aside: In general, if you want other people to use and understand an xml format you've made up, you should formally specify what tags are necessary, optional, etc. There is a special syntax for doing this, called a "DTD" -- a data type definition (sound familiar?). (Annoyingly, this special syntax could've itself used XML, but the designers failed to do this, and so DTD details are yet another language to learn...)
We won't touch DTDs for this class.
Observe that within a picture, the caption and filename might occur in either order.
The white space (blanks, tabs, returns) have no meaning (beyond making it easy for humans to scan).

This xml is all good and well, but we want to process this information in scheme. We'll provide a function which will translate the above into in a corresponding X-expression:

(list 'gallery
      (list 'title "Me at the Britney Spears Concert")
      (list 'picture
            (list 'filename "pict01.jpg")
            (list 'caption "Waiting in line for a Pepsi."))
      (list 'picture
            (list 'filename "pict07.jpg")
            (list 'caption "Waiting in line for " (list 'em "another") " Pepsi."))
      (list 'picture
            (list 'caption "Waiting in line for the bathroom.")
            (list 'filename "pict19.jpg")))


;; Or, equivalently, using the strictly-optional quoted-list form,
;; as mentioned in lecture friday:
;;
'(gallery (title "Me at the Britney Spears Concert")
          (picture (filename "pict01.jpg")
                   (caption "Waiting in line for a Pepsi."))
          (picture (filename "pict07.jpg")
                   (caption "Waiting in line for " (em "another") " Pepsi."))
          (picture (caption "Waiting in line for the bathroom.")
                   (filename "pict19.jpg"))

Note the similarity between xml and scheme lists: In scheme there are only three types of parens: round (,) and square [,] and squirrely {,}; these can all nest, and each open must match its close. (You can use these parens interchangeably.)

In xml, it's the same, except there are lots of types of parens (which take more than a single character to write): there're gallery-parens <gallery>,</gallery> and em-parens ,, etc. As you'd expect, these parens can nest, and each open must match its close.

Xexpr is just a convention to represent this plethora of parentheses inside of scheme: we pretend that we have this plethora of parentheses: we just insist that each list begin with a symbol (representing name of the xml parentheses -- or "tag"). We won't repeat that symbol at the closing paren, since it's implicit.

Further examples of data.

A document is also structured info, of course: it consists of paragraphs; paragraphs are a list of words, links, and emphasized sections. Furthermore, links themselves are words (that can be clicked on) You can express all these in xml, by declaring certain tags to represent paragraphs, lists, list-items, etc. This is exactly what html is: A set of xml tags, used to represent the structure of text-documents. The name even comes from "hypertext markup language". For example, here is some xml which also happens to be html:

html source info How your browser interprest this info

 This sentence is a list of words, some of which are meant to be empahasized! Even subparts of words can be emphasized. We indicate what to emphasize by placing it between matching "em" tags. Each paragraph is delimited between "p" tag and its matching closing tag, "/p", as you see. Blank lines don't count. Though of as an xexpr, it becomes clear this whole thing is a list of (four) (paragraph) xexprs; each paragraph itself contains a list of xexprs... An ordered list, "ol", is also structured data: a sequence of list-items. <ol> <li> this is the first item. </li> <li> this is the second. </li> <li> note that the actual numbering of the items is not included; that's implicit. </li> </ol> 

This sentence is a list of words, some of which are meant to be empahasized! Even subparts of words can be emphasized. We indicate what to emphasize by placing it between matching "em" tags.

Each paragraph is delimited between "p" tag and its matching closing tag, "/p", as you see. Blank lines don't count. Though of as an xexpr, it becomes clear this whole thing is a list of (four) (paragraph) xexprs; each paragraph itself contains a list of xexprs...

An ordered list, "ol", is also structured data: a sequence of list-items.

this is the first item.

this is the second.

note that the actual numbering of the items is not included; that's implicit.

Of course, there are many other tags used in html. (Try the "view-source" option in your browser, to see the xml for this document itself!)

html source info	How your browser interprest this info
<p> This sentence is a list of words, <em>some of which are meant to be empahasized!</em> Even <em>sub</em>parts of words can be emphasized. We indicate what to emphasize by placing it between matching "em" tags. </p> <p> Each paragraph is delimited between "p" tag and its matching closing tag, "/p", as you see. Blank lines don't count. Though of as an xexpr, it becomes clear this whole thing is a list of (four) (paragraph) xexprs; each paragraph itself contains a list of xexprs... </p> <p> An ordered list, "ol", is also structured data: a sequence of list-items. <ol> <li> this is the first item. </li> <li> this is the second. </li> <li> note that the actual numbering of the items is not included; that's implicit. </li> </ol> </p>	This sentence is a list of words, some of which are meant to be empahasized! Even subparts of words can be emphasized. We indicate what to emphasize by placing it between matching "em" tags. Each paragraph is delimited between "p" tag and its matching closing tag, "/p", as you see. Blank lines don't count. Though of as an xexpr, it becomes clear this whole thing is a list of (four) (paragraph) xexprs; each paragraph itself contains a list of xexprs... An ordered list, "ol", is also structured data: a sequence of list-items. this is the first item. this is the second. note that the actual numbering of the items is not included; that's implicit.

Note, for those who know some html: our data defintion for xexpr doesn't include attributes. If you want, you can do this whole assignment with a slightly different data definition: a tagged-xexpr is (instead): (cons symbol (cons attr-list xlist)), where the second item, the attr-list, is a list of entries, where each entry is (list symbol string). The provided library, which will let you read/write xexprs as xml-files, will contain a flag, letting you say whether you are imitating attributes as 'attr elements (as in the regular assignment), or if you are using this real version of xexprs.

(Further aside, on attributes vs. elements: Note that both approaches convey the same information; when designing an xml language, it is sometimes a toss-up whether to include something as an attribute, or an element. The only guideline is that attributes are only simple strings -- not xexprs that can contain further elements.)