Comp 212 - Fall 2003

Lab 14: Finite State Machine and Lexical Analysis


Finite state machines play an important role in the theory and practice of computing.  Lecture #30 shows an application of finite state machine in the design of an infix calculator.  This lab introduces how finite state machines are used as language recognizers. A classic example is the lexical analyzer (or tokenizer) of a compiler. The main task of the lexical analyzer in a compiler is to scan the input text and produce a stream of tokens for the parser to perform syntax analysis.

This lab will require you to review stream IO and parsing discussed in lecture 27lecture 32 and lecture 33.  Work together in groups of two.  Your labbies will discuss your design with you but will not write the code for you.  Feel free to interact with other groups in the lab however.  Feel free to share the code done in the lab.


Tokenizing numbers

Suppose we define a number as a sequence of one or more digits, where digit is one of '0', '1', ...'9'.  We can express such a definition in the following form (called "regular definition").

From the above definition, 0123, 123, 4 are numbers.  Review the definition of the abstract token AToken and its concrete subclasses in lecture 32 and lecture 33.  For all practical purpose, the concrete IntToken class can be used to represent the numbers defined in the above.

One common technique for implementing a lexical analyzer to recognize the above numbers is to realize it as finite state machine with the following state transition diagram.

Exercise 1

Use the state design pattern to implement the above finite state machine as a class called Tokenizer   Tokenizer has a Reader to scan the input and should have one public method 

AToken getNextToken() 

that is to return an concrete IntToken if the current input is a number and a NullToken otherwise.

Do not define the states as anonymous inner classes of Tokenizer.  Define the states as external classes instead.  This will require some form of communication between Tokenizer, the context, and its states.

Extending the definition of numbers

Now we extend the definition of a number as a sequence of one or more digits followed by zero or exactly one fractional part, where the fractional part consists of a period followed by one or more digits.  We can express such a definition as the following regular definition.

The notation (.digit+)? means zero or one .digit+.

From the above definition, 123, 4.5 are numbers while 6. is not.   Below is the corresponding state transition diagram.

 

Exercise 2

What do you need to do to Tokenizer and its states in exercise 1 in order to recognize this extended definition of numbers?

Do it!

 


 
 
Last Revised 11/30/2003 by Dung "Zung" Nguyen