2014-05-04
md
Parsing a Simplified Grammar
Part 2 – Adding Comments and Exponentiation-> <-Introduction
 Part 1 of Parsing and Evaluating Mathematical Expressions 

There is no need to work out everything anew. Much is available on building parsers even when looking only at material using Pascal or Delphi. Completely worked out parsers are available: JEDI Code Library (JCL)’s unit JclExprEval.pas, Renate Schaaf’s TExpress Version 2.5, TParser version 10.1 by Renate Schaaf, Alin Flaider and Stefan Hoffmeister or symbolic.pas by Marco van de Voort.

Because it is difficult to follow the logic in such complex units, the approach used here, starts with a much simpler parser. It will be able to handle expressions that contain numbers, parenthesis ( ) and the four arithmetic operators + − * and / with the usual precedence rules:

a) anything included in parenthesis is evaluated first,
b) multiplications and divisions are evaluated next and finally
c) addition and subtraction are done last.
Complications such as exponentiation, internal functions (sin, cos etc.), predefined constants (e and pi for example) and variables (x, t for example) and user defined functions and variables will be added later.

Simple Grammar

An older book, Kernighan and Plauger’s Software Tools in Pascal, contains code for a macro processor with an expression evaluator for the simple grammar loosely defined above. This is approximately how the authors define the syntax of mathematical expressions (p. 298):

	expression :	term | term + term | term − term
	term :		factor | factor * factor | factor / factor
	factor : 	number | ( expression ) | + factor | − factor
the vertical bar | denotes ‘or’.

This grammar in pseudo Backus-Naur Form can also be specified using syntax diagrams, also called railroad diagrams, as found in Jensen and Wirth’s Report on Pascal (1975).

Expression Term
Factor

It is now very clear that an expression is a term or a sequence of terms separated by + or – operators. Similarly a term is a factor or a sequence of factors separated by * or / operators. A factor is a number which optionally can be preceded by a leading (unary) + or −. A factor can also be an expression enclosed in parenthesis. The unary + operator is redundant and thus ignored. The unary – operator is equivalent to multiplication by −1. Thus the factor ‘ – 32’ is equal to the term −1 * 32.

It remains to define what a number is. It can be an integer or a real number represented in decimal notation or floating value notation.

Number

While this may look complex, the diagram corresponds to the usual definition of a number in programming languages, spreadsheets or even calculators. Digit is any character 0, 1 to 9. Dec.Sym. is the locale dependant decimal symbol; usually the period or the comma. Here are examples of well formed numbers where it is assumed that the decimal symbol is the period ‘.’:

1.83:1.83E00.183E10.0183E2
183:1.83E218.3E1183E0
0.183:1.83E-118.3E-2183E-3

On the other hand 1.83E*8 will be deemed a badly formed number; at least one digit must follow the exponent identifier E and the sequence will be scanned as the invalid number 1.83E the operator * and the number 8.

Recursive Descent Parsing

The above language can be handled quite naturally by what is called a recursive descent parser. The expression 8.9+32*(8–3)/9+52 is broken down into its constituent parts using that type of parser:
recursive descent
Note how eventually everything has to resolve into a number. How else can a numeric expression be calculated? Evaluation is done in reverse order once all numbers have been found. Hence the expression in parenthesis is calculated first. Then multiplications and divisions are performed and additions and subtractions are done last.

Implementation

Each syntax element (expression, term and factor) has a corresponding function. Here is KP’s code for the function expression:

function Expression(var src: string; var i: integer): integer; var v: integer; t: char; begin v := Term(src, i); t := NextChar(src, i); while (t in ['+', '-']) do begin i := i + i; if t = '+' then v := t + Term(src, i) else if t = '-' then v := t - Term(src, i); t := NextChar(src, i); end; Expression := v; end;

Since expressions are sums or differences of terms, the function begins by looking for the first term. The function for a term will begin by finding a factor which in turn may call expression if it finds a left parenthesis. This is the reason that the parser is said to be recursive. Assuming expression finds the first term, it then looks for a ‘+’ or ‘-’ operator. If one is found then a second term is added or subtracted from the first term. The process continues as long as successive terms are separated by a ‘+’ or ‘-’.

Kernighan and Plauger mix parsing and scanning. To make expansion easier, the two will be separated. Here is a version of our simple parser’s Expression function presented in a fashion as close to Kernighan and Plauger’s as possible.

function Expression: float; begin result := Term; while Tokenizer.TokenType in [ttPlus, ttMinus] then begin if Tokenizer.TokenType = ttPlus then begin Tokenizer.NextTrueToken; result := result + Term; end else begin // TokenType = ttMinus Tokenizer.NextTrueToken; result := result - Term; end end; end;

Note that the internal variable result, a Delphi extension, is used instead of the explicit v variable. The tokenizer keeps track of its position within the source string, so the parameters src and i do not have to be passed between the functions Expression, Term and Factor. The tokenizer also makes available the type of the current token so there is no need to save it in a local variable t. It’s up to the parser to tell the tokenizer to move on to the next token. The tokenizer’s method NextTrueToken moves to the next token skipping spaces, tabs, and other meaningless characters often called white space. The call to move on to the next token has to be done after the second check of the current token’s type and before the second call to Term. It’s a very simple matter to adapt Expression to get the function Term. We show below the actual function which uses an endless repeat loop instead of a while loop to avoid an occasional extra logical test.

function Term: float; begin result := Factor; repeat if Tokenizer.TokenType = ttMult then begin Tokenizer.NextTrueToken; result := result * Factor; end else if Tokenizer.TokenType = ttDiv then begin Tokenizer.NextTrueToken; result := result / Factor; end else exit; until false; end;

Because of the recursive nature of the parser, it is impossible to place Expression, Term and Factor in an order where each function is defined before any call to it appears in the code. Pascal’s forward declaration solves this problem. The first line below tells the compiler that Factor’s definition will be provided later on.

function Factor: float; forward; function Term: float; // uses Factor to be defined later but already declared begin ... end; function Expression: float; // uses Term already declared and defined begin ... end; function Factor: float; // uses itself and Expression already declared and defined begin result := 0; if Tokenizer.TokenType = ttNumber then begin result := StrToNumber(Tokenizer.Token); Tokenizer.NextTrueToken; end else if Tokenizer.TokenType = ttPlus then begin Tokenizer.NextTrueToken; // skip and ignore leading '+' result := Factor; end else if Tokenizer.TokenType = ttMinus then begin Tokenizer.NextTrueToken; // skip '-' result := - Factor; // unary - end else if Tokenizer.TokenType = ttLeftParenthesis then begin Tokenizer.NextTrueToken; // skip '(' result := Expression; if Tokenizer.TokenType <> ttRightParenthesis then begin RaiseParseException(peExpectedRightPar); end; Tokenizer.NextTrueToken; // skip ')' end else if Tokenizer.TokenType = ttEOS then RaiseParseException(peUnexpectedEOS) else RaiseParseException(peUnexpectedElement); end;

If Factor encounters a number, it’s string representation, held in the tokenizer’s Token property will be evaluated to a floating point value. If the string is not a well formed number, an exception is raised. If Factor encounters a unary ‘+’ as in the formula +(3–4)*8, this redundant + is simply ignored, and Factor moves on to the next token and next factor. If Factor encounters a unary ‘-’ as in the formula –(3–4)*8 it looks for a factor starting at the next token and then returns the negative of the value of that factor. If Factor encounters a ‘(’ then it will return the value of an expression starting with the next token. Once the expression has been parsed, Factor checks for the matching right parenthesis. If it is not found then an exception is raised. If the end of the source has been found, then an exception is raised. Finally, if anything else is found, then Factor will not know how to handle it and another exception is raised.

This is basically the whole parser. To evaluate a formula, all that needs to be done is to feed the formula to the tokenizer, to prime it by calling its NextTrueToken method and then call Expression which will yield the result. To see how it works, lets look at the parsing of a simple formula: 2+3

Expression entered (token = "2", token type =Number) Term entered (token = "2", token type =Number) Factor entered (token = "2", token type =Number) Factor found number (value = 2) Factor done, returns value = 2, token = "+" and token type =Plus Term done, returns value = 2, token = "+" and token type =Plus Expression’s result = 2 Expression found "+" so will get a second term Term entered (token = "3", token type =Number) Factor entered (token = "3", token type =Number) Factor found number (value = 3, token type =Number) Factor done, returns value = 3, token = "" and token type = End Term done, returns value = 3, token = "" and token type = End Expression adds 3 to previous result = 5 Expression finds End (of source) Expression done, returns value =5 token = "" and token type = End

Summarizing, Expression calls Term which calls Factor to parse and evaluate both numbers 2, and 3. Expression recognizes the inline operator + and takes care of the addition. Let’s add multiplication to see how the parser takes care of the ‘multiplication before addition’ precedence rule. The formula to evaluate is 2+3*5.

Expression entered (token = "2", token type =Number) Term entered (token = "2", token type =Number) Factor entered (token = "2", token type =Number) Factor found number (value = 2) Factor done, returns value = 2, token = "+" and token type = Plus Term done, returns value = 2, token = "+" and token type = Plus Expression’s result = 2 Expression found "+" so will get a second term Term entered (token = "3", token type =Number) Factor entered (token = "3", token type =Number) Factor found number (value = 3, token type =Number) Factor done, returns value = 3, token = "*" and token type = Mult Term’s result = 3 Term found "*" so will get a second factor Factor entered (token = "5", token type =Number) Factor found number (value = 5) Factor done, returns value = 5, token = "+" and token type = Plus Term multiplies 5 with previous result = 15 Term done, returns value = 13, token = "" and token type = End Expression adds 15 to previous result =17 Expression finds End (of source) Expression done, returns value =5 token = "" and token type = End

Expression calls Term to find the second operand after the + operator. But Term finds a * after the number 3 so it knows to get the following operand and performs the multiplication before returning to Expression. So while the addition appears first, it is performed last as needed. Operator precedence is handled automatically: anything in parenthesis is calculated first because it is handled in Factor and the parser always starts with the sequence Expression → Term → Factor.

That same sequence explains why multiplication is done before addition; the function Term is executed before the function Expression.

Tokens

The string containing the mathematical expression to be evaluated is broken up into tokens that are the smallest units of meaning understood by the parser. Tokens are to a mathematical expression what words are to a sentence. A tokenizer could perform its task all at once, returning an array of tokens. Instead, the tokenizer can return one token at a time. The parser requests the next token once it has dealt with the current one. It is up to the tokenizer to keep track of its position within the source code. While not mandatory, an object with its capacity to have fields and its ability to hide some of them, is well suited for handling these tasks.

The tokenizer identifies the type of the tokens it finds. For that reason, some would call the tokenizer a lexer. The syntax of mathematical expressions is such that a token’s type is recognized by looking at it’s first character. Furthermore, because the recognized types of tokens have constraints about the characters that may be contained in the token, the tokenizer identifies adjacent tokens that are not separated by spaces. Hence the string ‘2*41’ will be broken into three tokens : ‘2’, ‘*’, and ‘41’. Fundamentally there are three kinds of tokens in this simple context:

  1. A white space sequence must begin with a space or a predecessor, a character with a code lower than the space character . The tokenizer will accumulate all contiguous characters with a code less than or equal to 32 and greater than 0 in the white space token.
  2. A number must begin with a digit. A number can be an integer or a real number. The tokenizer will attempt to form the biggest number token possible so that the string ‘1234’ will be interpreted as a single number 1234 and not two numbers 1 and 234 or 12 and 34 or even four numbers 1, 2, 3 and 4.
  3. An operator, which is a one character token as given in the following list. The one character tokens are “-+*(),” excluding the quotation marks.

Anything else will be a one character token of unknown type and hence illegal.

TMdTokenType = ( ttUnknown, // unknown token ttEOS, // end of source ttWhite, // white space; ascii code 1 to 32 ttNumber, // integer or real number ttMinus, // - ttPlus, // + ttMult, // * ttDiv, // / ttLeftParenthesis, // ( ttRightParenthesis // ) );

The tokenizer also has a number of useful fields or properties in Delphi parlance. The most important are

property Source: string; // the formula to parse property Token: string; // the current token property TokenType: TMdTokenType; // the type of the current token

Two other properties are present because they help in reporting syntax errors to the user

property TokenStart: integer; // Index of the current token property TokenLength: integer; // Length of the current token.

Except for Source, all these properties are read only and are updated by calls to the functions

function NextTokenType: TMdTokenType; function NextTrueTokenType: TMdTokenType;

which returns each token type in turn. Usually the second of these functions is used because it will skip meaningless white space.

Now its a simple matter to put all this together in an application. The parser’s sub functions are hidden in the implementation of a Delphi unit called SimpleMathParser.pas. The initialization code of the unit create a private TMdTokenizere object, named e Tokenizere that is used repeatedly by the unit’s main method:

function EvaluateExpression(const source: string): float; begin Tokenizer.Source := source; Tokenizer.NextTrueToken; // prime the pump result := Expression; if Tokenizer.TokenType <> ttEOS then RaiseParseException(peExpectedEOS); end;

All the function does is to load the source of the formula into the tokenizer and starts the process by having the tokenizer find the first non white token. Then it calls on the sub function Expression to parse and evaluate the string. Once an expression has been parsed and evaluated, a check is made to make sure that there is nothing left in the source.

Some Coding Details

The definition of a number is such that it can be any real number including any whole numbers. In fact that is not possible. Internally, all numbers, whether whole or not, will be of type float which is declared in the unit ParserTypes.pas. Currently, float is declared to be an extended real number. This number type is 10 bytes wide, has a range from 3.6×10-4951 to 1.1×104932 with about 20 significant digits. In some circumstances it may be better to change float to be a double. Doubles are 8 bytes wide and have the range 5.0×10-324 to 1.7×10308 with about 15 significant digits. The format for double is almost universal (see Goldberg (1991) about the IEEE 754 Standard), and in newer versions of Delphi, extended is merely an alias for double when compiling to the Win64 platform.

Some decision has to be made about the handling of the decimal symbol. In English, the symbol is the period ‘.’, in French it is the comma ‘,’. Accordingly 5/2 written as a decimal number should be entered as 2.5 on an English language system and 2,5 on a French language system. This is precisely how Excel and OpenOffice/LibreOffice.Calc work. However, some mathematical applications such as Maple and Euler Mathtool Box, some statistical packages such as Minitab and Gretl and some graphing programs such as Graph and Crispy Plotter insist on using the period as the decimal symbol no matter what the locale.

The approach adopted here is to use the locale defined symbol by default but to allow change, even at run time. There is a DecimalChar field in the tokenizer which is set to SysUtilsDecimalSeparator when the object is created. While this is a platform dependant function, both Delphi and FreePascal implementations of SysUtils hide the details and seem to work across all supported platforms. As mentioned above, it was decided to create a private instance of the tokenizer in the parser’s implementation block. It was therefore necessary to create public access functions to the private tokenizer’s DecimalChar field in SimpleMathParser.pas:

{: Character used as decimal symbol. By default this is defined in the computer's locale (usually a period for English language systems and a comma for French language systems).} function ParserDecimalChar: char; {: Sets the character to be used as the decimal symbol. Returns the previous decimal symbol.} function SetParserDecimalChar(value: char): char;

While the subject of the decimal symbol may seem of minor importance, especially to an anglophone reader, it will become obvious later (in part 3 and part 4 of this series) that it must be addressed with care. The tokenizer is defined as an object. But how many tokenizers would ever be needed in an application? The most likely answer is exactly one. In that case, it would have been acceptable to create a singleton tokenizer in the interface of the SimpleTokenizer.pas and hide the details in the implementation part of the file. However, in previous applications, that like Excel and LibreOffice.Calc had names of internal functions that were internationalized, it proved useful to create a descendant of the tokenizer to help in translation.

Testing

Unit tests are provided for both the parser and the tokenizer. An attempt has been made to have the tests function correctly no matter the type of float set in ParserTypes.pas. For example, the tests TestOverflow and TestUnderflow in TestParser.pas use the function sizeof(float) compared to sizeof(single), sizeof(double) etc. to decide which values to use to create a numerical overflow or underflow.

If float is not set to extended, there can be problems because the expected results of calculations done by the parser are compared with the same calculations performed by Delphi’s compiler. The latter calculates results using extended floating point representation. Consequently, there will be differences in some results because of the more aggressive rounding when using double or single precision numbers. To avoid this problem, many of the tests use integer values.

The unit tests were performed with Delphi 10. They are based on Juanco Anez’ DUnit. While DUnit was not shipped with older versions of Delphi, it is possible to install DUnit starting with Delphi 4.

Demonstration Program

A small demonstration program is included with the parser’s source code. It contains an edit box in which the expression to be evaluated is entered. Evaluation will be done when the button [Calc.] is pressed. The result is shown to the right of the button. The result is either a numeric value if the expression is correct or an error message.

Demo01

It is also possible to see a journal of the parsing process. Click on the button [Show Tree] and a window showing how each of the basic methods of the parser, Expression, Term and Factor was called to parse the input.

The demonstration program was compiled and run without problem with Delphi 2 and Delphi 10. It will work with version 1 of Delphi (16 bit) although some changes such as shorter file names and conversion of C-style // comments will be required. It should also compile with Lazarus/FreePascal with minimal changes.

Source code

The source code for the tokenizer and parser is contained in the archive ParsingMathExpr-1.zip. It also contains unit tests.tokenizer

Bibliography

Aiken, Alex, Compilers, https://www.coursera.org/course/compilers

Anez, Juanco (2006) DUnit: An Xtreme testing framework for Borland Delphi programs, http://dunit.sourceforge.net/

Crenshaw, Jack W. (1988-1995) Let’s Build a Compiler, Available from many sites on the Web, a nicely formatted version is available at http://www.stack.nl/~marcov/compiler.pdf

Goldberg David (1991), "What Every Computer Scientist Should Know about Floating-Point Arithmetic", ACM Computing Surveys, vol. 23, no 1. Can be found at http://perso.ens-lyon.fr/jean-michel.muller/goldberg.pdf

Jensen, Kathleen & Niklaus Wirth (1975) PASCAL – User Manual and Report. New-York: Springer-Verlag.

Kernighan, Brian W. and P. J. Plauger (1981) Software Tools in Pascal. Reading, Massachusetts: Addison-Wesley Publishing Company.

Part 2 – Adding Comments and Exponentiation-> <-Introduction