Comp 212 - Lab 12: Java Stream I/O - Recursive Descent Parsing


Introduction

In Java, input and output is defined in terms of an abstract concept called "stream".  A stream is a sequence of data.  If it is an input stream, it has source.  If it is an output stream, it has a destination.  There are two kinds of streams: byte streams and character streams.  The java.io package provides a large number of classes to perform stream I/O.  Mastering Java stream I/O seems like a daunting task.  Do not fret.  In the beginning, you only need to learn to how to manipulate a few I/O classes.  As you progress, you can figure out on your own how to use the other I/O classes and even design new customized I/O classes.  This tutorial describes a few commonly used I/O classes and a class used to parse the input streams called StreamTokenizer.  To illustrate the use StreamTokenizer, the tutorial provides you with a few code examples, and leads you to the process of developing a recursive descent parser to parse scheme-like arithmetic expressions and build the corresponding abstract syntax trees.


The InputStream and OutputStream classes

Input and output in Java of byte (i.e. binary) streams are handled through subclasses of the java.io.InputStream and java.io.OutputStream classes.  These classes are abstractions that Java provides for dealing with reading and writing information sequentially to/from anything you want, be it a disk, a string buffer, an enumeration, or an array of bytes.  We will not spend time on these classes in this lab.  Click on this link to see more details on  InputStream and OutputStream


The Reader and Writer classes

Input and output for characters (i.e. ASCII) streams are handled through subclasses of the java.io.Reader and java.io.Writer classes.  These classes are abstractions that Java provides for dealing with reading and writing character data.  Below is the UML diagram for a few commonly used Reader classes.

reader.png (23099 bytes)

And here is a UML diagram for common Writer classes.

writer.png (24678 bytes)

 

The section on parsing below will illustrate the use of Reader and Writer streams to read a text file containing a Scheme like arithmetic expression, parse it, and build the corresponding abstract syntax tree.


StreamTokenizer and Parsing

StreamTokenizer

Many times when reading an input stream of characters, we need to parse it to see if what we are reading is a word or a number.  This process is called "tokenizing".  java.io.StreamTokenizer is a class that can do simple tokenization.  TestStreamTokenizer.java is a sample program showing how to use StreamTokenizer.  Copy this file and the file input.txt to your local directory and test it out.

Abstract Syntax Tree

To further illustrate the use of  text file IO and StreamTokenizer to parse an input file, we write a program to read in scheme-like arithmetic expressions in pre-fix format, such as (* 3  (+ 4  5)), and build the corresponding abstract syntax tree  (AST):

    *
  /    \
3      +
      /     \
    4        5

Once the AST is built, one can then traverse and evaluate it, as done in Homework #3 (download the solution)!  We shall restrict ourselves to arithmetic expressions containing integers, strings (representing variables), and binary operations such as  addition and multiplication.  The syntax of the Scheme-like arithmetic expression can be expressed as a "grammar" with the following rules.

Scheme Arithmetic Expression Grammar Rules

Description

E :: Leaf | BinOp An expression E  is either a Leaf or a BinOp.
Leaf :: VarLeaf | IntLeaf A Leaf is either a VarLeaf or an IntLeaf
VarLeaf :: id A VarLeaf is an id, where id is a token representing a variable.
IntLeaf :: int An IntLeaf is an int, where int is a token representing an integer.
BinOp :: AddOp | MulOp A BinOp is either an AddOp or a MulOp
AddOp :: ( + E E ) An AddOp is a left parenthesis token,  followed by a plus token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.
MulOp :: ( * E E ) An MulOp is a left parenthesis token,  followed by a star token, followed by an expression E, followed by another expression E, followed by a right parenthesis token.

 

An example for such an expression is: (* 34 (+ xyz 7)), where

In the above grammar rules, 

Recursive Descent Parsing

The manufacturing of an abstract syntax tree (AST) for the above grammar can be thought of a factory method, makeAST(), of some abstract factory, IASTFactory.  How the AST is created is a variant as there are many ways to parse the input stream.  We shall implement a special parsing technique called "recursive descent parsing" (RDP).

In RDP, we use a "tokenizer" to scan and "tokenize" the input from left to right, and build the AST from the top down, based on the value of the tokens.  Building an AST from the top down means building it by first looking at the grammar rule for the start symbol E and try to create an AST that represent E.

As we look at the grammar rule for the start symbol E in the above, we can see that in order for RDP to make an AST, it needs to know how to make a Leaf AST and a BinOp AST.

Here is some pseudo-code.

makeAST():

makeLeaf():

makeIntLeaf(): create the IntLeaf directly.

makeVarLeaf(): create the VarLeaf directly

makeBinOp():

makeAddOp():

makeMulOp(): analogous to makeAddOp().

In short, a RDP

Usually, parsing only involves checking for the syntax of an input expression.  However, in our case, we decide to build the abstract syntax tree directly as we parse.

public class RdpASTFactory implements IASTFactory {
    private AToken _currentTok;
    private ITokenizer _tokenizer:
    public RdpASTFactory(ITokenizer tokenizer) { _tokenizer = tokenizer;}
    public IAST makeAST() { // etc...}
    protected IAST makeLeaf() { // etc...}
    // other (protected)factory methods, etc.
}

Tokens and Tokenizer

The above pseudo-code suggests designing a union of token objects coupled with visitors, and an abstract tokenizer class that knows how to get the next token.  One way to implement this tokenizer class is to wrap a StreamTokenizer object.

public abstract class AToken {
    protected String _lexeme;
    protected AToken(String lex) { _lexeme = lex; }
    public String getLexeme() { return_lexeme;}
    public abstract Object execute(ATokVisitor v, Object inp);
}

public class IdToken extends AToken{
    // constructor, etc.
    public Object execute(ATokVisitor v, Object inp) { 
        return v.idCase(this, inp);)
    }
}

public abstract class ATokVisitor {
    /**
    * This is the case when the host is a concrete IdToken.
    * Throws an IllegalArgumentException as default behavior.
    * @param host a concrete IdToken.
    * @param inp the input needed for this ATokVisitor to perform its task.
    */
    public Object idCase(IdToken host, Object inp) {
        throw new IllegalArgumentException("Wrong token!");
    }
    // etc..., a method for each concrete token case.
}

interface ITokenizer (
   public AToken getNextToken();
}

public class StreamTokWrapper implements ITokenizer {
    private StreamTokenizer _st;
    // constructor and methods, etc...
}

Lab Activities

Divide the lab into 4 groups.

To test the code, download the solution to homework #3, unzip it, and move the various java files into their appropriate subdirectories to match the packaging.

 


Command-line arguments

Sometimes when running large programs you want to be able to set options about how your program should behave at runtime.  For instance, you might have a hierarchy of debugging statements that you want to suppress at run-time, or you might have a variety of features that you want your users to have access to (check out the manual page for the Unix command "ls"--there's a vast amount of features regarding how to display your data, what to display, ad nauseam).  You can handle this in several ways, but most people make use of command-line arguments to do it.

Remember how all your main functions for your programs have to have the same signature?

        public static void main(String[] argv) { ... }

The array of Strings is where you keep your command-line arguments.  Everything after the class name is included in argv. Let's say we wanted to write a program that just echoed back the command-line arguments to stdout. Here's how we'd do it:

    public class echo {

        public static void main(String[] argv) {
            for(int j = 0; j < argv.length; j++)
                System.out.print(argv[j] + " ");  // don't insert an end-line character yet
            System.out.println();  // now print an end-line ('\n')
        }
    }
 


The System class

The java.lang.System class is a useful class containing useful static members doing useful things.  This should look familiar:

    System.out.println("Hello world!");

I'm sure you've done calls like this a million times already, but what's really going on?  out is a static member of the System class; it's a PrintStream object.  A PrintStream is a grandchild of OutputStream in the Java class hierarchy--it has methods implemented that print lines of text at a time as opposed to each character at a time.  System.out is initialized when a program starts to what is known as standard output (stdout).  Stdout is usually the monitor screen, but you can also send stdout to a file at runtime by redirecting it from the Unix command line.  For example, to send the stdout to file "outfile.txt", we do the following:

    % java MyClass > outfile.txt

There is also a System.in class popularly known as standard input (stdin).  Stdin is an InputStream, initially set to taking input from the keyboard, but it can also read from a file at runtime like this:

    % java MyClass < infile.txt

There's a third System i/o file called standard error (stderr).  System.err is another PrintStream designed to direct error messages in case you don't want your output and error messages going to the same place.  Stderr is also initialized to the monitor, but you can redirect it like this:

    % java MyClass >& errfile.txt

And you can combine the redirections:

    % java MyClass < infile.txt > outfile.txt >& errfile.txt

There's lots of other stuff the System class provides for you--check it out.

NOTE: System.out and System.err are the ONLY PrintStream object you should use.  Class PrintStream is deprecated.  To output character streams, you should use PrintWriter (shown in a UML diagram above and in  TestStreamTokenizer.java) instead.

So what would you do if you wanted to, oh, I don't know, read in an arbitrarily long text file of floating point numbers?



Dung X. Nguyen & Jim O'Donnell
revised 04/07/02