Assignment 17 - Finite Automaton

Getting Started.

Please refer to this page for information on how to work on your assignment.

Outline

A finite automaton is an algorithm with a finite memory which scans through an input string or text. The finite memory is given by states, one of them being a starting state and several accepting states. Furthermore, it has an update function which computes from the old state and the current datum the new state. For example, the following automaton checks whether in a string there is an even number of vowels.

      State-Names: Even, Odd;
      Starting State: Even;
      Accepting States: Even;
      Rejecting States: Odd; 
      Alphabet: Lower case letters;
      Update Function:
        Old State    New State at "a","e","i","o","u"; at other letters  
          Even         Odd                               Even
          Odd          Even                              Odd

      So New State ("a",Even) is Odd,
         New State ("b",Even) is Even,
         New State ("c",Odd)  is Odd.

      Sample Runs:   Input seen so far        State of Automaton

        (1)            -                        Even
                       w                        Even
                       wh                       Even
                       who                      Odd
                       whos                     Odd
                       whose                    Even

         (2)           -                        Even
                       w                        Even
                       wh                       Even
                       wha                      Odd
                       what                     Odd

Example (1) accepts and Example (2) rejects the input as after processing the complete word whose the automaton is in the accepting state Even and after processing the complete word what the automaton is in the rejecting state Odd. The type of finite automata dealt with here is called deterministic since the function to compute the new state permits always only one unique value; otherwise the automaton would be called non-deterministic. For more information on finite automata, see Wikipedia and Panet Math.

The task of this exercise

The automaton of this exercise does the following: it scans through a text and checks whether this text has the following minimum qualities:
1. It consists of one or several sentences.
2. Each sentence ends with a dot ., an exclamation mark ! or a question mark ?.
3. Inside a sentence, a colon : or a comma , can occur.
4. Each sentence starts with a word which has the first character being an upper case letter.
5. Spaces can occur after but not before sentence signs.
6. A word can start with any letter followed by arbitrary many lower case letters.
7. After a word there is a space or one of the characters mentionned in points 2 and 3.
8. Numbers can occur inside the sentence and every number is either a zero 0 or one of the digits from 1 to 9 followed by arbitrarily many further digits.
9. There should be no spaces at the beginning of the sentence and no double spaces at all.
The task is to extend these rules as follows and to adapt the below syntax checker accordingly.
1. Semicolons ; should be permitted and treated in the same way as a comma or colon.
2. Plus + and minus - signs might precede a number, but not a zero.
3. Hyphens - are permitted, but only inside a word, so there should be at least one letter before and at least one letter after the hyphen. Note: minus and hyphen are represented by the same symbol.
4. Quotations are marked by single quotes ' like 'this is a quotation' and a quotation starts with a quote symbol after a space and finishes with another quote symbol either before a space or a sentence sign.
5. Inside a word or at the end of a word, there can be an omission symbol, written as a single quote, as in the sample sentence His brothers' cars are green and he's going to his sister's house. which should not be confused with quotes.
The automaton which does the syntax checking has the following states which are listed together with a short intuitive explanation of their meaning:
```
  "e": error - this state is taken when an error is found;
  "f": automaton is going to read the first character of sentence;
  "c": automaton has seen a comma or similar sign ",", ":";
  "d": automaton has seen a dot or similar sign ".", "!", "?";
  "n": automaton is processing or has been processing a number;
  "z": automaton has seen a single zero;
  "s": automaton has seen a space;
  "h": automaton has seen a hyphen "-";
  "p": automaton has seen a plus "+" or a minus "-";
  "q": automaton is between two quotes of a quotation;
  "r": automaton has seen the second quote of a quotation;
  "w": automaton is processing or has been processing a word;
```
The newstate(oldstate, symbol) function computes how to update the state of the automaton when it is processing a symbol. The only thing which needs to be adapted is this function. Here some explanation on an example how this function works. Consider the following sample run:
```
  My number is 007.
  wwswwwwwwswwszeee
  The sentence is syntactically incorrect. 
```
In this example, the sentencecheck() function is already doing what it should as in the specification above, it is said that no number except 0 itself should state with the symbol 0, so James Bond would have to write his number as "0 0 7" in order to pass the syntax checking. The sample run is a short representations for the following calls of the newstate() function which always takes the old value of the state and computes from it and the symbol above the new value of the state. The initial value "f" is dropped in above representation, but the second line gives the states after having processed the corresponding part of the word above it, so "wwsw" are the states taken while processing "My n" with "w" being the state after these four symbols have all been processed. A longer representation of this expression would be the following one:
```
  "w" = newstate("f","M");
  "w" = newstate("w","y");
  "s" = newstate("w"," ");
  "w" = newstate("s","n");
  "w" = newstate("w","u");
  "w" = newstate("w","m");
  "w" = newstate("w","b");
  "w" = newstate("w","e");
  "w" = newstate("w","r");
  "s" = newstate("w"," ");
  "w" = newstate("s","i");
  "w" = newstate("w","s");
  "s" = newstate("w"," ");
  "z" = newstate("s","0");
  "e" = newstate("z","0");
  "e" = newstate("e","7");
  "e" = newstate("e",".");
```
This input is rejected, that is, classified as false by the finite automaton. Indeed, it would reject any input where the machine is not in the state "d" after having processed all the symbols of the input. The most straightforward implementation of the newstate() function would be a large table which tells the function what value to assign for which state and which input, this table would be finite as there are only finitely many states and finitely many symbols. Somehow, the table would be extremely large and difficult to typeset, therefore it is much more convenient to write a JavaScript function instead as it is done in this exercise.

The function has a variable rv which contains the return value of the function which is passed back as the new state after it finishes. The initial value of rv is "e", so that the error-state is taken in the case that the function does not provide an explicit case for what to do with the old state and symbol in the input. The switch command in the function has entries for all states which need to be addressed, for example in the case of the state "f", the entry looks like this:
```
case "f":   if(("A" <= symbol) && ("Z" >= symbol))
            {
                rv = "w";
            }
            break;
```
So if the symbol is one of the upper case letters A, B, … , Z, the rv variable which has the next state takes the value "w" for indicating that a word is processed. Indeed, as a sentence should start with a word having an upper case letter at the beginning, this is exactly what the program should do. Other options should make the automaton to take the error state "e" what indeed happens whenever it is not clearly overwritten. The break command indicates that the case dealing with "f" is completed. Here another example, namely the one inside a word:
```
case "w":   if(("a" <= symbol) && ("z" >= symbol))
            {
                rv = "w";
            }
            if(symbol == " ")
            {
                rv = "s";
            }
            if((symbol == ".") || (symbol == "?") || (symbol == "!"))
            {
                rv = "d";
            }
            if((symbol == ",") || (symbol == ":"))
            {
                rv = "c";
            }
            break;
```
Here the update rule is more complicated. There are four if statements, each dealing with another transfer of the state. If the symbol is one of the letters a, b, … , z then the next state is again "w". If the symbol is a space then the next state is "s". If the symbol is ., ? or ! then the next state is "d" which stands for dot-like sentence signs. If the symbol is , or : then the next state is "c" which stands for comma-like sentence signs. If none of these cases apply, the return value will have the default "e" and the finite automaton will go into the error state.

In order to do the requested changes, one has to introduce such type of environments for the states "h" and "p". Note that "p" is entered from "s" when there is a + or - after a space. From "p" there should follow one of the digits 1 to 9 but not the digit 0. So "-325" and "+7711" are permitted but neither "+007", "-0" nor "+ 88". The hyphen state "h" should only be reached from "w" if there is an hyphen instead of a letter and after the hyphen, there should again be one of the letters a, b, … , z. If not, the automaton should take the state "e". Note that it depends from the context whether the symbol - is a minus or a hyphen.

Furthermore, at several places, ; has to be added into the list of permitted symbols for a state change. Also the quote symbol ' should be handled adequately. Only single quotes occur. If they are after a space, they indicate the beginning of a quotation and the automaton should take the state "q". What follows is already there. If they occur inside a word, they are not the beginning of a quotation but just an omission symbol as in the sentence He's running home fast to meet his sister's boyfriend. Therefore, quote symbols inside a word should be treated like letters. For the ease of programming, it can be assumed that the first symbol of a sentence is not a quote symbol and that the above two cases are all which occur. Quote symbol denoting omissions or genitives can be treated like lower-case letters and do not need an extra state of the automaton.

JavaScript Starts Here.

JavaScript Ends Here.