Comp 212 - Lab 11: Recursive Descent Parsing


Introduction

 This tutorial describes an important parsing technique called "recursive descent parsing".  It leads you to the process of developing such a parser to parse scheme-like arithmetic expressions and build the corresponding abstract syntax trees.  It makes use of the StreamTokenizer discussed in lecture 30 and the results from homework #3 on abstract syntax trees.


Abstract Syntax Tree

We shall write a program to read in from a text file scheme-like arithmetic expressions in pre-fix format, such as (* 3  (+ 4  5)), and build the corresponding abstract syntax tree  (AST):

    *
  /    \
3      +
      /     \
    4        5

Once the AST is built, one can then traverse and evaluate it, as done in Homework #3 (download the solution)!  We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables), and binary operations such as  addition and multiplication.  The syntax of the Scheme-like arithmetic expression can be expressed as a "grammar" with the following rules.

Scheme Arithmetic Expression Grammar Rules

Description

E :: Leaf | BinOp An expression E  is either a Leaf or a BinOp.
Leaf :: VarLeaf | IntLeaf A Leaf is either a VarLeaf or an IntLeaf
VarLeaf :: id A VarLeaf is an id, where id is a token representing a variable.
IntLeaf :: int An IntLeaf is an int, where int is a token representing an integer.
BinOp :: AddOp | MulOp A BinOp is either an AddOp or a MulOp
AddOp :: ( + E E ) An AddOp is a left parenthesis token,  followed by a plus token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.
MulOp :: ( * E E ) An MulOp is a left parenthesis token,  followed by a star token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.

 

An example for such an expression is: (* 34 (+ xyz 7)), where

In the above grammar rules, 

Recursive Descent Parsing

The manufacturing of an abstract syntax tree (AST) for the above grammar can be thought of a factory method, makeAST(), of some abstract factory, IASTFactory.  How the AST is created is a variant as there are many ways to parse the input stream.  We shall implement a special parsing technique called "recursive descent parsing" (RDP).

In RDP, we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the AST from the top down, based on the value of the tokens.  Building an AST from the top down means building it by first looking at the grammar rule for the start symbol E and try to create an AST that represent E.

As we look at the grammar rule for the start symbol E in the above, we can see that in order for RDP to make an AST, it needs to know how to make a Leaf AST and a BinOp AST.

Here is some pseudo-code.

makeAST():

makeLeaf():

makeIntLeaf(): create the IntLeaf directly.

makeVarLeaf(): create the VarLeaf directly

makeBinOp():

makeAddOp():

makeMulOp(): analogous to makeAddOp().

In short, a RDP

Usually, parsing only involves checking for the syntax of an input expression.  However, in our case, we decide to build the abstract syntax tree directly as we parse.

public class RdpASTFactory implements IASTFactory {
    private AToken _currentTok;
    private ITokenizer _tokenizer:
    public RdpASTFactory(ITokenizer tokenizer) { _tokenizer = tokenizer;}
    public IAST makeAST() { // etc...}
    private IAST makeLeaf() { // etc...}
    // other (private) factory methods, etc.
}

Tokens and Tokenizer

The above pseudo-code suggests designing a union of token objects coupled with visitors, and an abstract tokenizer class that knows how to get the next token.  

public abstract class AToken {
    protected String _lexeme;
    protected AToken(String lex) { _lexeme = lex; }
    public String getLexeme() { return_lexeme;}
    public abstract Object execute(ATokVisitor v, Object inp);
}

public class IdToken extends AToken{
    // constructor, etc.
    public Object execute(ATokVisitor v, Object inp) { 
        return v.idCase(this, inp);)
    }
}

public abstract class ATokVisitor {
    /**
    * This is the case when the host is a concrete IdToken.
    * Throws an IllegalArgumentException as default behavior.
    * @param host a concrete IdToken.
    * @param inp the input needed for this ATokVisitor to perform its task.
    */
    public Object idCase(IdToken host, Object inp) {
        throw new IllegalArgumentException("Wrong token!");
    }
    // etc..., a method for each concrete token case.
}

public interface ITokenizer (
   public AToken getNextToken();
}

public class StreamTokWrapper implements ITokenizer {
    private StreamTokenizer _st;
    // constructor and methods, etc...
}

Lab Activities

Divide the lab into 3 groups.

  1. One group builds the tokenizer StreamTokWrapper: play around with TestStreamTokenizer and modify it using a file with a scheme arithmetic expression as input.
  2. One group builds the union of tokens: AToken and its concrete variants.  This should be trivial.  After this group is done, it should split in two subgroups: one would join group 1 and the other would join group 3 in order to communicate their results to the other groups.
  3. One group write the recursive descent parser.  This will involve writing all the needed visitors as anonymous inner classes.

To test the code, download the solution to homework #3, unzip it, and move the various java files into their appropriate subdirectories to match the packaging.

Have fun!



Dung X. Nguyen 
revised 11/11/02