Configurable Parser

A great advangtage of the configurable parser is that it can be configured at run time. It makes the process of adding new operators easier and allows other grammatical rules to be included.

Basic architecture

The parser has two main parts: a tokenizer and a grammar analyzer.

The tokenizer breaks the input into a series of tokens and the grammar analyzer reads these tokens and turns them into a tree of nodes. Each type of token is recognized by a TokenMatcher and new TokenMatchers can be added. There are Tokens and corresponding TokenMatchers for numbers, variables, strings, functions operators, white space and comments. New TokenMatchers can be added at run time to allow a configurable syntax.

After tokenizing, a filtering step is performed on the list of tokens. This is mainly used to remove white space and comments, although other operations on the list could be performed.

The final stage is to interpret the tokens and build a tree of nodes. This stage uses the precedence rules of operators so that the expression 2*3+4 is correctly interpreted as (2*3)+4 and not 2*(3+4). The core of the algorithm is an operator precedence parser using the shunting yard algorithm, most of the grammar rules are constructed from the operators specified in the OperatorTable and are build automatically. Additional grammar rules can be specified by adding GrammarMatchers to the parser. Such additional rules are used to specify the syntax for functions, and lists.

Adding and changing the tokenizer stage

To allow new lexical elements, a new TokenMatcher should be added to the list of token matchers used by the parser. A number of predefined TokenMatchers are already defined. See the Matchers namespace for a list of these.

To create a new TokenMatcher, a new class implementing the ITokenMatcher interface should be created. Typically this will sub-class one of the existing TokenMatchers.

namespace SingularSys.Jep.ConfigurableParser.Matchers
{
  /// Interface defining classes which match tokens
  public interface ITokenMatcher {
    /// Attempts to match the start of the string.
    /// Note new objects should be created each time as error 
    /// reporting information is later attached to tokens.
    Token Match(String s);
	
    /// Initialize the matcher when the Jep instance is known. 
    void Init(JepInstance j);
  }
}
	
In general the match method should return one of the pre-defined tokens listed in Tokens namespace although other token types can be used if there is a corresponding GrammarMatcher.

Once created, the TokenMatcher needs to be added to the list of matchers used by the parser. The order is important as each matcher is called in turn and some input will match more than one type of input. Typically the full set of lists will need to be added in the correct order. See the example below.

Adding new operators

Most changes to the syntax will simply consist of adding new operators or changing the symbol of existing operators. A simple example would be:

OperatorTable ot = jep.OpTab;
// create a round operator
Operator op = new Operator("~", new Round(), 
                           Operator.UNARY + Operator.PREFIX);
// add it with the same precedence as not
ot.AddOperator(ot.NumOps, op, ot.GetNot());
// informs the parser and other components about the new operator
jep.ReinitializeComponents();

Once added, the new operator is ready to be used in the parser. For more details on adding operator see Operators manual page.

Adding a GrammerMatcher

New grammatical rules can be implemented by creating a class implementing the IGrammarMatcher interface.

namespace SingularSys.Jep.ConfigurableParser.Matchers
{
  /// Interface defining matchers for custom grammatical elements.
  /// GrammarMatchers match syntax elements at the same precedence level as brackets
  /// they can examine the next two tokens in the input using the Lookahead2Iterator
  /// and call the GrammarParser.parseSubExpression() to parse expression fragments.
  public interface IGrammarMatcher {
	
    /// Test whether the input matches this pattern.
    /// param name="it": An iterator inspecting the input
    /// param name="parser": the parser to use when evaluating sub expressions
    /// returns: if matched returns a node representing the content, 
    ///          return null is does not match
    INode Match(Lookahead2Enumerator it, IGrammarParser parser);
    	
    /// Delayed initialization, this methods is called whenever 
    /// components of the jep instance are changed. 
    /// param name="jep": the current jep instance.
    void Init(JepInstance jep);
  }
}

The match method can query the next two tokens from the input using it.PeekNext() and it.PeekNextnext() if these tokens match the rule then the current position of the input should be advanced using it.Consume(). If the rule does not match then the match method should return null before calling it.Consume(). Further tokens can be read using a combination of it.PeekNext() and it.Consume().

Various methods of the Token class can be used to query the type of token; for instance Token.IsFunction(), Token.IsIdentifier(), Token.IsNumber(). The Token.Equals(Object o) method can also be used to check the status of tokens.

For functions and lists it may be necessary to parse the arguments or list elements. These can be parsed using the public Node ParseSubExpression() method of the IGrammarParser interface.

Once the input has been parsed, the resulting node needs to be assembled. Here the NodeFactory methods can be used to construct nodes of the appropriate type. The OperatorTable, FunctionTable, VariableTable and NumberFactory classes can also be used.

SymbolTokens

New syntactical features may require special symbols, for instance the [ and ] used to represent lists. These symbols need to be recognized by the Tokenizer stage and used later by appropriate GrammarMatachers. The SymbolTokenMatcher class can recognize a set of SymbolTokens, which are loaded using its public boolean Add(SymbolToken e) method. The same tokens can then be passed in the constructor of a GrammerMatcher and the tokens equal method used to test it it the same token as in the input.

// A matcher for special symbols
SymbolTokenMatcher stm = new SymbolTokenMatcher();
m.Add(stm); // add to the list of matchers

// Create a special symbol and add it to the list 
SymbolToken colon = new SymbolToken(":");
stm.Add(colon);

...
public class MyGrammarMatcher {
  SymbolToken colon;
  public MyGrammarMatcher(SymbolToken colon) {
    this.colon = colon;
  }
  
  public Node Match(Lookahead2Iterator<Token> it,IGrammarParser parser)
  {
    Token t = it.PeekNext();
    // use this way round to avoid probems when t is null
    if (colon.Equals(t))
      ....
  }
  ...
}

Example grammar matcher

The following code is an example of matching a function
/// A GrammarMatcher which matches functions in the form 'atan2(y,x)'.
/// The function must be in the FunctionTable and brackets are required.
public class FunctionGrammarMatcher : IGrammarMatcher {
  private Token open; // = new SymbolToken("(");
  private Token close; // = new SymbolToken(")");
  private Token comma; // = ",";
  
  private NodeFactory nf;
  /// Create a FunctionGrammarMatcher
  /// param name="open": token representing an opening bracket
  /// param name="close": token representing a closing bracket
  /// param name="comma": token representing a list item separator
  public FunctionGrammarMatcher(Token open, Token close, Token comma) {
    this.open = open;
    this.close = close;
    this.comma = comma; 
  }
  
  /// Initialize the node factory
  public void Init(JepInstance jep)  {   nf = jep.NodeFac;      }
  
  /// Attempt to match a function, calls the GrammarParser.parseSubExpression() 
  /// to match function arguments.
  public SingularSys.Jep.Parser.INode Match(Lookahead2Enumerator it, 
                                            IGrammarParser parser)  
  {
    Token t = it.PeekNext();
    if(t==null) return null;
    if(!t.IsFunction()) return null;

    String name = t.GetSource();
    IPostfixMathCommand pfmc = ((FunctionToken) t).GetPfmc();
    if(!open.Equals(it.PeekNextnext())) return null;
    it.Consume();
    it.Consume();

    if(close.Equals(it.PeekNext())) {
      if(!pfmc.CheckNumberOfParameters(0))
        throw new ParseException("Function "+pfmc+" invalid number of arguments 0");
      it.Consume();
      return nf.BuildFunctionNode(name, pfmc,new INode[0]);
    }

    List seq=new List();
    while(true) {
      INode contents = parser.ParseSubExpression();
      seq.Add(contents);
      if(close.Equals(it.PeekNext())) // this way round to cope with null
        break;
      else if(comma.Equals(it.PeekNext()))
        it.Consume();
      else
        throw new ParseException("Closing bracket not found. Next token is "
                     +it.PeekNext());
    }
    it.Consume();

    if(!pfmc.CheckNumberOfParameters(seq.Count))
      throw new ParseException("Function "+pfmc+
	              " invalid number of arguments "+seq.Count);
    return nf.BuildFunctionNode(name, pfmc,seq.ToArray());
  }
}

Example usage

The following code illustrates how a configurable parser could be initialized.

List<TokenMatcher> m = new ArrayList<TokenMatcher>();
List<TokenFilter> filters = new ArrayList<TokenFilter>();
List<GrammerMatcher> gm = new ArrayList<GrammerMatcher>();

// Add token matchers for comments
m.Add(CommentTokenMatcher.HashCommentMatcher());
m.Add(CommentTokenMatcher.SlashSlashCommentMatcher());
m.Add(CommentTokenMatcher.SlashStarCommentMatcher());
m.Add(CommentTokenMatcher.MultiLineSlashStarCommentMatcher());

// match different types of string, ' and "
m.Add(StringTokenMatcher.DoubleQuoteStringMatcher());
//m.Add(StringTokenMatcher.SingleQuoteStringMatcher());

// match whitespace
m.Add(WhiteSpaceTokenMatcher.DefaultWhiteSpaceTokenMatcher());

// match numbers
m.Add(NumberTokenMatcher.ExponentNumberTokenMatcher());

// match variable or function names
m.Add(IdentifierTokenMatcher.BasicIndetifierMatcher());

// match operators in the OperatorTable
m.Add(new OperatorTokenMatcher());

// A matcher for special symbols
SymbolTokenMatcher stm = new SymbolTokenMatcher();
m.Add(stm);

// Special symbols used in the grammar
SymbolToken ropen = new OpenToken("(");
SymbolToken rclose = new CloseToken(")");
SymbolToken sopen = new OpenToken("[");
SymbolToken sclose = new CloseToken("]");
SymbolToken comma = new SymbolToken(",");

stm.Add(ropen);
stm.Add(rclose);
stm.Add(sopen);
stm.Add(sclose);
stm.Add(comma);

// Equations can be separated by a semi-colon
// This matcher causes the tokenizer to stop parsing when a semi-colon
// is encountered.
m.Add(TerminalTokenMatcher.SemiColonTerminalMatcher());

// filter out comments
filters.Add(new WhiteSpaceCommentFilter());

// Handle brackets, this matcher uses round brackets for arrays (2,3)
gm.Add(new ListOrBracketGrammarMatcher(ropen,rclose,comma));

// match functions like cos(pi/2)
gm.Add(new FunctionGrammerMatcher(ropen,rclose,comma));

// match arrays with square brackets [2,3]
//gm.Add(new ListGrammarMatcher(sopen,sclose,comma));

// match the array access operation v[2]
gm.Add(new ArrayAccessGrammarMatcher(sopen,sclose));

// Construct the Jep instance and set the parser
jep = new JepInstance();
jep.Parser = new ConfigurableParser(m,subs,gm);
 	
top