LingDoc Transform

 

A tool for document and language engineering

 

 

 

Reference Manual

 

Version 3.2

 

 

 

 

 

 

 

Palstar

Uffelte

The Netherlands

www.lingdoc.eu

www.palstar.nl

 

 

 


1........... Introduction. 6

1.1      Hands‑on. 6

1.1.1      Example on Dutch license plates. 6

1.2      From grammar to program.. 7

1.2.1      Preparation phase. 7

1.2.2      Compilation phase. 8

1.2.3      Execution phase. 8

1.3      Grammars 8

1.3.1      Meta language. 9

1.3.2      Types of rules and types of grammars 10

1.4      The compiler and the run time system.. 11

1.4.1      Finite state automata. 11

1.4.2      Technical remarks 12

1.5      Miscellaneous concepts 12

1.5.1      Recursion. 12

1.5.2      Ambiguity. 13

1.5.3      Unit rules and empty rules 13

1.6      Contents of this manual 13

2........... Basic Syntax. 15

2.1      Meta symbols 15

2.2      Terminal symbols 16

2.2.1      Characters and strings 17

2.2.2      Ranges 18

2.2.3      Wildcards 19

2.3      Non-terminal symbols 19

2.4      Regular expressions 20

3........... Extended Features. 22

3.1      Actions 22

3.1.1      Assignment and tests 22

3.1.1.1    Variables, literals and expressions 22

3.1.1.2    Assignment 23

3.1.1.3    Types of tests 24

3.1.1.4    Combinations of tests and/or assignments 25

3.1.2      Messages 26

3.1.2.1    Simple messages 26

3.1.2.2    Messages with expressions 27

3.1.3      Places of actions in rules 28

3.1.3.1    Actions following symbols 28

3.1.3.2    Actions following regular expressions 29

3.1.3.3    Actions as part of the repetition of regular expressions 29

3.1.3.4    Actions in empty units 30

3.2      Symbol type extensions 31

3.2.1      Non-terminals with parameters 31

3.2.2      Lexicon symbols 33

3.2.3      Cover non-terminals 34

3.2.4      Intermediate symbols 35

3.3      Negation. 36

3.3.1      Negation of terminal characters, ranges and intermediates 37

3.3.2      Negation of non-terminal symbols 37

3.3.3      Negation of regular expressions 37

3.3.4      Negation of a string of terminal symbols 38

3.4      Error recovery. 38

4........... Context Sensitive Grammars. 40

4.1      Description of context sensitive rules 40

4.2      Context sensitive grammars for recognition or parsing. 40

4.2.1      Example of a context sensitive grammar used for recognition. 40

4.2.2      (Cover) non-terminals in context sensitive grammars 41

4.2.3      Intermediates in context sensitive grammars 42

4.3      Transduction grammars 43

4.3.1      Description of transduction rules 43

4.3.2      Non-terminals in transduction rules 45

4.3.3      Intermediates in transduction rules 45

4.3.4      Ambiguity and "looping" problems in transduction grammars 46

4.3.5      Compiling and running a transduction grammar 47

4.3.6      Cascaded grammars 47

5........... Run Time Input and Output files. 48

5.1      Input files 48

5.2      Appearance of symbols in output files 49

5.3      Parse tree output 50

5.4      Report output 51

5.4.1      Report of messages 51

5.4.2      Report of output parameters 52

5.5      Error output 52

6........... Switches (Directives) 52

6.1      Compiler Directives 53

6.1.1      GUI: Advanced Mode, Standard fields: Compiler 53

6.1.1.1    Switch: FINITE. 54

6.1.1.2    Switch: ONE_CS_REDUCE. 54

6.1.1.3    Switch: ONE_ALTERNATIVE. 54

6.1.1.4    Switch: RECURSION_MESSAGE. 55

6.1.1.5    Switch: SHARED_UNIVERSE. 55

6.1.1.6    Switch: UNIVERSE. 55

6.1.1.7    Switch: TRANSDUCE. 55

6.1.1.8    Switch: PREPARE_ PARSETREE. 56

6.1.1.9    Switch: INTERM_INPUT. 56

6.1.1.10  Switch: CHECK_VARIABLES. 56

6.1.1.11  Switch: COMP_BASE. 56

6.2      Runtime Directives 56

6.2.1      GUI: Advanced Mode, Standard fields: Parser 56

6.2.1.1    Switch: ECHO.. 57

6.2.1.2    Switch: IGNORE_CR. 57

6.2.1.3    Switch: SPLIT_OUTPUT. 57

6.2.1.4    Switch: BUILD_PARSETREE. 57

6.2.1.5    Switch: SKIP_CHARS. 58

6.2.1.6    Switch: SKIP_ON_ERROR. 58

6.2.1.7    Switch: OUTPUT_PARAMETERS. 58

6.2.1.8    Switch: LEXICON.. 58

6.2.1.9    Switch: LEX_CHARS. 58

6.2.1.10  Switch: SINGLE_LINES. 58

6.2.1.11  Switch: IGNORE_BLANKS. 59

6.2.1.12  Switch: FLAT_OUTPUT. 59

6.2.1.13  Switch: CASE_CONVERT. 59

6.2.1.14  Switch: FIND_ON_ERROR. 59

6.2.1.15  Switch: PARSE_BASE. 59

6.2.1.16  Switch: LEX_INCREMENT. 60

6.3      Switches for error recovery. 60

6.3.1.1    Switch: CPU_LOG.. 60

7........... Operation. 60

7.1      Compiler 60

7.1.1      Commands for compilation in batch. 61

7.1.2      Screen output in the compilation phase. 61

7.1.3      Files related to compilation. 62

7.1.4      Error handling during compilation. 62

7.2      Run time system.. 63

7.2.1      Commands for the run time system in batch. 63

7.2.2      Screen output and keyboard input during the execution phase. 64

7.2.3      Files related to the run time system.. 64

7.2.4      Error handling during run time. 65

8........... Problems and current limitations. 66

8.1      Current limitations 66

8.1.1      Upper limits for the size of a grammar 66

8.1.2      Finite versus non-finite. 66

8.1.3      Limitations concerning context sensitive rules 67

8.1.4      Limitations concerning actions 67

8.1.5      Miscellaneous limitations 67

8.1.6      Known effects of "bugs" 68

8.2      Debugging Grammars 68

9........... Tools. 69

9.1      Lexicon. 70

9.1.1      GUI: Advanced Mode: Lexicon-Compiler 71

9.2      Drawtree. 71

9.3      Drawgraph. 72

9.4      Plink. 72

9.4.1      GUI: Advanced Mode: Cascade Linker 73

9.5      Xgram.. 73

9.6      Disassembler 74

9.6.1      GUI: Advanced Mode: Disassembler 74

10.......... Overview. 75

10.1     Meta language of Transform grammars 75

10.2     Rules for the construction of user defined symbol names 76

10.3     Checklist for switches 77

10.4     Batch-Commands 78

11.......... APPENDIX A: SWITCHES FOR PROGRAMMERS. 79

11.1     Compiler programmer switches 79

11.1.1     GUI: Advanced Mode: Compiler, Advanced Fields 79

11.1.1.1  Switch: TEST_COMP_SETS. 79

11.1.1.2  Switch: TEST_LABELTREES (obsolete) 79

11.1.1.3  Switch: WRITE_PGFILE. 79

11.1.1.4  Switch: OPTIMIZE. 79

11.1.1.5  Switch: TRACE_COMP. 80

11.1.1.6  Switch: REDUCE_SHIFT. 80

11.1.1.7  Switch: PRINT_SETS. 80

11.1.1.8  Switch: ADD_CS_RULES. 80

11.1.1.9  Switch: SKIP_TREES (obsolete) 80

11.1.1.10 Switch: UPDATE_LABFILE. 80

11.1.1.11 Switch: PRINT_STRUCTURE. 81

11.1.1.12 Switch: DEBUG_CONTROL. 81

11.1.1.13 Switch: TEST_TREES (obsolete) 81

11.1.1.14 Switch: SKIP_LABELTREES (obsolete) 81

11.1.1.15 Switch: WRITE_SYMBOLICS. 81

11.1.1.16 Switch: PRINT_RESULT. 81

11.1.1.17 Switch: DEALLOCATION.. 81

11.1.1.18 Switch: WATCH.. 81

11.1.1.19 Switch: COMP_STATISTICS. 82

11.1.1.20 Switch: INTERACTIVE. 82

11.1.1.21 Switch: PRINT_STATES. 82

11.1.1.22 Switch: BOOKKEEPING_ITEMS. 82

11.2     Run time programmer switches 83

11.2.1     GUI: Advanced Mode: Parser, Advanced Fields 83

11.2.1.1  Switch: TREE_INPUT (obsolete) 83

11.2.1.2  Switch: TRACE_USECOUNT. 83

11.2.1.3  Switch: PRINT_PARSETREE. 83

11.2.1.4  Switch: COUNT_CONNECTORS. 83

11.2.1.5  Switch: TRACE_LEVEL. 84

11.2.1.6  Switch: TRACE_ERROR. 84

11.2.1.7  Switch: ONLINE_MESSAGES. 84

11.2.1.8  Switch: PARSE_STATISTICS. 84

11.2.1.9  Switch: COUNT_PARSETREES. 84

11.2.1.10 Switch: BATCH.. 85

11.2.1.11 Switch: TRACE_PARSE. 85

11.2.2     Switches for interactive use. 85

12.......... APPENDIX B: ASCII TABLE. 86

13.......... APPENDIX C: COMPILER ERROR MESSAGES. 87

13.1     Errors in values of switches 87

13.2     Insufficient constants 87

13.3     Not yet implemented features 87

13.4     Syntax errors in the grammar, not detected by Bootstrap. 88

13.5     Errors in the .IM file. 88

13.6     Debug facilities 89

13.7     Implementation errors 89

 

 


1     Introduction

This reference manual presents a general overview of LingDoc Transform, together with a detailed description of the Transform formalism, its operation and its auxiliary tools.

 

LingDoc Transform (short Transform) is introduced in a Management Overview and in Five Position Papers. For a formal description of the underlying ideas and principles we refer to the following book:

 

              G.J. van der Steen. A program generator for recognition, parsing and transduction with syntactic patterns. CWI Tract 55, Center for Mathematics and Computer Science. P.O. Box 4079, 1009 AB Amsterdam, The Netherlands, 1988, 284 pp., ISBN 90 6196 361 3.

 

An ordinary user who wants to make use of the possibilities of the system has to understand some of the underlying concepts and must be familiar with some of the terminology. In this introductory chapter only those terms will be explained which are indispensable for the understanding of the remaining chapters of this manual. Where necessary the text is illustrated with examples.

 

1.1     Hands‑on

 

A hands-on experience is described in the User Manual of Transform, together with an introduction to the formalism. In this Reference Manual we go into more detail and describe the Transform system from a general point of view.

 

1.1.1        Example on Dutch license plates.

 

In addition to the examples within the User Manual we will use, throughout this Reference Manual, an example on Dutch license plates.

 

Dutch license plates consist of a combination of pairs of letters and pairs of digits. The pairs are separated by hyphens. Back in ancient history, the plates started with letters, followed by two pairs of digits. Nowadays, plates start with a pair of letters, followed by two digits and another pair of letters. Here is the history:

 

      AB‑12‑34

      12‑34‑AB

      12‑AB‑34

      AB‑12‑CD

 

Now, enter the following grammar using an editor that produces plain text files. Name the file LICENCE.GRM.. (This file is already contained within the directory Examples_Reference_Manual). The .GRM extension indicates that the file contains an Transform grammar.

 

      /* LICENCE.GRM */

 

      LicensePlates::=  ( LicensePlate , CR )+ .

      LicensePlates::=  Letters , Hyphen , Digits , Hyphen , Digits {S:2} |

                        Digits  , Hyphen , Digits , Hyphen , Letters {S:3} |

                        Digits  , Hyphen , Letters , Hyphen , Digits {S:4} |

                        Letters , Hyphen , Digits , Hyphen , Letters {S:5} .

      Letters     ::=   [a-z] {R:1} , [A-Z] {R:1} .

      Digits      ::=   [0-9] {R:1} , [0-9] {R:1} .

      Hyphen      ::=   '-' {R:1} .

 

 

The texts between the curly brackets are not part of the actual grammar; they are only there to ensure that the grammar produces some interesting output, which does not mean that they may be omitted. The heading, the grammar name between ‘/*’ and ‘*/’, is just a comment.

 

Now, create a file called LICENCE.SW. Its contents should be a single line saying FINITE (lowercase is allowed as well). Alternatively, use the GUI in the “Advanced mode” to switch FINITE on. This will create the file LICENCE.SW automatically.

 

This LICENCE.SW file is consulted by the Transform compiler. It contains so‑called switches: instructions to the compiler to modify its behavior to some extent. In this case the compiler is instructed to generate a so‑called finite state machine. The file modifies the default settings of the Transform compiler. If no .SW file exists, the compiler uses its default settings.

 

After setting the switches, select the grammar in the GUI by pressing the button “Select grammar”. Then, invoke the Transform compiler by pressing the button “Compile grammar”.

 

Before the compiler starts, the syntax of the grammar is checked. This is done by means of the grammar of Transform Grammars, called BOOTSTRAP.GRM.

 

When the compilation is complete, you can look in the directory which contains the grammar. You will find that some other files have been created.

 

At this stage, you can invoke the program that actually runs the grammar. This run time system requires an input file. Therefore, create an input file containing correct license plates. Name the file PLATES.TXT. (This file is already present). Its contents should be:

 

      12‑34‑AB

      12‑AB‑34

      AB‑12‑34

      AB‑12‑CD

 

In the GUI, select the input file by pressing the button “Run grammar”.

 

If all goes well (and let's hope it does), the Transform Process Monitor says "Input correct". This means that the Transform system did not find an incorrect license plate in your input. Try for yourself what happens if you make errors in the syntax of the license plates.

 

Apart from saying "Input correct", the parsing process has a side effect. There now is a file called PLATES.REP. You can view this file. Its contents list the original input (preceded by "1:"), followed by lines like "2:", "3:" and so on.

This .REP file (usually called report file) is produced because the grammar contained so-called messages (the text between the curly brackets), which are instructions to write part of the input to the output. The "{R:1}" strings in your grammar copy the input to the report file, preceded by "1:". The "{S:2}" string only writes "2:" to the report file. Note that the digits 2 through 5 in this grammar indicate from which period in the history the license plate stems.

 

1.2     From grammar to program

 

The Transform system consists of two main programs, a compiler and a run time system. The process in which these programs are used can be described in a cycle of three steps:

 

1.2.1        Preparation phase

 

              The first step is to write a grammar in the formalism prescribed by the system. This can be done by creating an ordinary text file with an editor. At this point the grammar writer can influence the outcome of the next step by describing the goals of his grammar (for example "description versus change" or "recognizing versus parsing") or by describing some peculiarities of the grammar. This can be done by creating a so-called switches file, which is also a text file.

 

1.2.2        Compilation phase

 

              In the next step the grammar must be compiled by the Transform compiler. In fact the compilation process consists of two stages. In the first stage the grammar is checked for grammatical errors; the second stage is the generation phase in which the machine code is generated that constitutes the program resulting from that particular grammar. If an error is detected in either stage, the grammar writer must return to the first phase and correct his grammar; otherwise he goes to the next phase.

 

1.2.3        Execution phase

 

              The compiled grammar is now transformed into a kind of program (also called automaton or machine) and can run over one or more input files, text files for which the grammar has been designed. This can be done with the help of the second main Transform program: the run time system. Again some options can be specified in the so-called switches file. For example, one can specify the name of a lexicon that must be used or one can instruct the run time system to neglect the difference between lowercase and uppercase letters. When errors are detected during this phase one can decide whether to correct the input file (for example when the errors were due to typing errors in the input file) or to start the cycle again by correcting the grammar. The different kinds of output (transduction output, parse trees, etc.) will be written to separate files which the user can inspect or use in other programs.

 

In the remaining sections of this chapter a number of concepts belonging to the automaton theory will be used. This manual is, however, not the place to give an introduction in this theory. For the reader who wants to know more about this subject we refer to the literature.

 

1.3     Grammars

 

Formal grammars consist of a set of interdependent rewriting rules. With these rules one describes syntactic regularities in a specific domain. For instance, a grammar for Simplified English gives a description of all the possible sentences that are grammatical in this particular domain. The combined set of all these sentences, however, is only a subset of all English sentences. And as another example, a grammar for the transformation of graphemes into phonemes in Dutch describes the rules by which pronunciation takes place.

 

A grammar defines a language, consisting of all sentences (ranges of characters or words) that are correct according to this grammar. The program resulting from a compiled grammar will mark every input sentence as correct or incorrect according to the grammar. When the terms "language" and "sentence" are used in this manual, they refer to this particular definition, unless stated otherwise.

 

The automatic processing of texts by means of a grammar can serve different purposes. Three goals can be distinguished:

 

1.           Recognition. In this case a grammar is compiled to a program which will take a string as input and either accept or reject it depending on whether the string is a sentence which may be produced by the syntactic description. Such a program is called a recognizer.

 

2.           Interpretation. The grammar is not only used to analyze an input string in syntactic terms, but also to give a structural description of the sentence in the form of a so-called parse tree. This kind of activity is known as parsing and the program resulting from such a grammar is called a parser.

 

3.           Change. The input material (or parts of it) has to be replaced by other characters and/or words, has to be omitted and/or has to be placed in a restructured order. The grammar contains rules by which this transformation must take place. This process is called transduction and the compiled grammar a transducer.

 

The difference between recognizing and interpreting programs is only a gradual one; programs of the latter type will be used as a combined interpreter/recognizer. They will decide whether or not an input string is correct according to the syntactic description of the grammar and at the same time a parse tree (or alternative parse trees) will be built. In the Transform system the three goals mentioned were unified in one single concept. At the same time a number of grammatical formalisms were unified [1]. Facilities for pattern matching are also included. See for a short overview the Position Paper “LingDoc and Grammar Formalisms”.

 

 

1.3.1        Meta language

 

Grammars must be written in a kind of formal language, the meta language. This language defines its objects and prescribes the appearance of grammars.

 

The syntax for Transform grammars is fully described in Bootstrap, a "grammar for Transform  grammars". This grammar describes all correct grammars that can be written in the Transform formalism. It is used in the first phase of the compilation process when the input, a grammar written by a user, is checked for syntax errors. The outcome can be:

 

             the user grammar does not contain any syntactic errors and will be recognized as a Transform grammar, or

             the user grammar is syntactically incorrect according to the Bootstrap grammar and the errors will be marked.

 

In the Transform system, a grammar consists of an unordered set of rules. Each rule is basically built out of three groups of components:

 

1.           Terminal symbols. These are characters and/or words literally found in the input that must be recognized, parsed or transduced by the compiled grammar.

 

2.           Non-terminal symbols. The user can define auxiliary terms as names or abbreviations for sequences of terminal symbols and/or non-terminal symbols. A non-terminal can be used for two different purposes:

                       For convenience, for example as an abbreviation for a long series of symbols, repeated several times in the grammar.

                       In a structural description as a name for an entity that corresponds with a part of the object. For example one can use the name "NounPhrase" for a specific part of a natural language sentence.

 

3.           Meta symbols or reserved symbols (mostly punctuation marks). They make it possible to describe a complex structure in a compact notation.

 

A grammar rule consists of a left-hand side and a right-hand side, eventually separated by a rewrite sign (in the Transform formalism: a double colon "::"). The end of the rule is marked by a period ("."). The left-hand side of a rule may consist of one or more symbols separated by the concatenation sign (comma ","). The right-hand side of a rule (also called the input side) may consist of a, possibly empty, sequence of symbols also eventually separated by the concatenation sign.

 

The meaning of a rule can be stated as:

              "the left-hand side may be rewritten in the right-hand side" or

              "the right-hand side is the production of the symbol at the left-hand side" or

              "the right-hand side is the definition of the left-hand side".

 

As soon as an input fits the description of the right-hand side of a rule, we can state:

              "the symbol on the left-hand has been recognized".

Another terminology designating this process is:

              "the right-hand side reduces the symbol on the left-hand side".

 

 

1.3.2        Types of rules and types of grammars

 

The number of symbols on either side of the rewrite sign of a grammar rule determines the type of the rule:

 

             If the left-hand side consists of only one symbol, the rule is said to be context free. The symbol on the left-hand side of a context free rule is always a non-terminal. The right-hand side may be empty.

 

             If the left-hand side consists of more than one symbol, the rule is said to be context sensitive. Within the Chomsky hierarchy of rules and grammars, a distinction is made between truly context sensitive rules and type‑0 rules. In truly context sensitive rules, the right-hand side is at least as long as the left-hand side. Type‑0 rules are shortening rules: the right-hand side is shorter than the left-hand side. They are also known as unrestricted rewriting rules. [2]

 

In connection with the different types of rules, different types of grammars are distinguished.

 

Context free grammars are grammars which consist of only context free rules. Each rule of a context free grammar rewrites one symbol in a sequence of zero or more symbols.

 

Context sensitive grammars are grammars which contain one or more context sensitive rule. Type‑0 grammars (or unrestricted rewriting grammars) are grammars which contain at least one shortening context sensitive rule. Type‑0 grammars are undecidable, which means that the parsing process may not terminate.

 

Within the Transform system, grammars that are used for recognition and for parsing can be context free or context sensitive, but type‑0 rules with an empty right-hand side are not allowed. Each grammar has a starting symbol, defined as the first symbol on the left-hand side of the first rule in the grammar.

 

Apart from context free rules, a grammar used for recognition or parsing can also contain context sensitive rules (making the grammar a context sensitive grammar). The class of context free languages is a real subset of the class of context sensitive languages. As a consequence, a grammar using context sensitive rules can recognize input that would be unrecognizable if the context sensitive rules were left out.

 

As an example, one may think of a grammar describing the plural of English nouns as "singular noun followed by an 's'". By means of a context sensitive rule stating that any "ies" at the end of a word may also be treated as if it were transformed into "ys" one can guarantee that forms like "spies" (derived from "spy" + "s") can also be recognized. This example will be explained in detail in section 4.2.1 (Example of a context sensitive grammar used for recognition).

 

Transduction grammars contain at least one context sensitive rule, possibly supplemented by one or more context free rule. The context sensitive rules in this type of grammar are called transduction rules. Transduction grammars do not have a starting symbol. Transduction rules can be shortening or lengthening, however, their right-hand side may never be empty. The goal of this type of grammar is to replace input conforming to the pattern described on the right-hand side of transduction rules with the characters underlying the pattern on the left-hand side.

 

 

1.4     The compiler and the run time system

 

The compiler transforms a grammar into a kind of program (or automaton): it generates code that can be interpreted by the run time system as instructions for the treatment of input for which the grammar was designed. For example, it can contain instructions for the transformation of recognized patterns, for building parse trees, for the consultation of an external lexicon, for emitting messages and for performing tests.

 

In generating the code the compiler describes every possible so-called state of the run time system during the treatment of the input. This state is only dependent on the section of the input already treated.

 

Dealing with natural language means dealing with ambiguities (see also section 1.5.2 Ambiguity). In the design of the Transform system, special care has been paid to handle ambiguity in an efficient way. The developed run time system is an interpreter for so-called Parallel Transduction Automata (PTAs). The run time system tries to follow all possible paths in parallel (breadth first). If there is more then one path, common parts of the resulting parse trees will be shared as much as possible [3]. As a consequence, recognition will take place in polynomial time, when this is theoretically possible. In case of an interpreting program the resulting output contains all remaining parse trees.

 

 

1.4.1        Finite state automata

 

Some types of grammar (namely non-nested context free grammars, see section 1.5.1 Recursion) can be compiled in so-called finite state automata. A finite state automaton can be modeled by a finite number of states, an input (tape) and a set of transitions which specify, given a state and an input, the resulting state. A finite state automaton is thus modeled as a computer without any memory.

 

Other types of grammar cannot be compiled into finite state automata, because the languages they describe are too complex to use an automaton without any memory. The number of states that is needed will become infinite unless memory is added to the automaton. Therefore a stack is used as a storage device; the run time system must, for some time, postpone some decisions and must therefore keep track of past treatments.

 

For example, grammars containing nested recursion (see section 1.5.1 Recursion) can never be compiled into a finite state machine. Some extended features in the Transform formalism also make use of a stack in run time. Grammars making use of these features must always be compiled to a non-finite automaton, although the type of grammar may not demand it. Other features (for example negation and transduction) can only be used in grammars to be compiled into a finite state machine.

 

The choice between generation of a finite state machine and a non-finite state machine is (still) left to the user. One can add the switch FINITE or NOFINITE to the switches file. At appropriate places in the presentation of the Transform formalism one will find remarks on the possible combinations.

 

In spite of these limitations there remains a group of grammars for which one has a choice between the two possibilities. In these cases one can consider that a grammar compiled to a finite state machine will run faster (because the run time system does not need to maintain a stack) but the compilation needs more time and memory and will produce more program code (because more states will be described).

 

 

1.4.2        Technical remarks

 

The main programs of the Transform system as well as the auxiliary tools are written in standard Pascal. The system has been ported to a variety of hardware platforms. Current implementations include VAX/VMS, Sun (UNIX) and LINUX. The most recent version runs under Windows XP.

 

The compiler and the run time system can run on different machines because the generated program of a grammar is written in pseudo-code.

 

 

1.5     Miscellaneous concepts

 

In this section some terms that will be used in the rest of this manual are introduced: recursion, ambiguity and two special types of rules, the unit rules and the empty rules.

 

 

1.5.1        Recursion

 

Grammars may contain recursion caused by the (in)direct recurrence of the left-hand side non-terminal on the right-hand side of the same rule. A distinction is made between nested and non-nested recursion. Non-nested recursion, i.e. left or right recursion, is caused by the (in)direct recurrence of the non-terminal at the extreme sides of the right-hand side of a grammar rule. In nested recursion the non-terminal reoccurs somewhere in the middle of the rule. Examples:

 

                   Premodifier        ::=           Premodifier , Modifier .                                        /* non-nested, left recursion */

                   Postmodifier      ::=           Modifier , Postmodifier .                                    /* non-nested, right recursion */

                   Brackets              ::=           LeftBracket , Brackets , RightBracket .                             /* nested recursion */

 

Grammars with nested recursion theoretically cannot be compiled into a finite state automaton, because a stack is needed in run time[4]. Non-nested recursion is automatically handled by the compiler.

 

When a grammar is compiled into a finite state automaton, a check for nested recursion is always performed by the program. If one wants the same check performed in the case of FINITE=FALSE, one can add the switch RECURSION_MESSAGE to the switches file.

 

 

1.5.2        Ambiguity

 

During recognition, parsing and transduction, ambiguities may arise. The causes may be:

 

             inherent ambiguity: according to the syntactic rules there is more than one structural description of the input

 

             temporal ambiguity: according to the syntactic rules there is more than one way to proceed in order to reach one single final structural description.

 

In designing grammars related to natural languages, ambiguities are no exception. Often sentences can in fact only be described in several ways and as a consequence the user will expect more than one parse tree.

 

Ambiguities may also arise, however, as a consequence of badly designed grammars. This is partly a question of experience. Some advice however can be given; it will appear throughout the text of this manual in the form of style recommendations. The formalism of Transform offers a lot of possibilities to give a grammar text a transparent lay-out; also comments added to the text can be helpful to avoid unwanted ambiguity.

 

 

1.5.3        Unit rules and empty rules

 

By unit rules we will refer to a special type of rule occurring in grammars. If the right-hand side of a rule contains only one symbol (or a series of alternatives, each containing only one symbol), the symbol on the left-hand side will be reduced immediately as soon as one symbol in the input matches the right-hand side. Without special instructions the compiler will "eliminate" the symbols defined in this manner and as a consequence the run time system will run faster. However, because the run time system will have no knowledge about these symbols, they will not appear in the resulting parse trees. One can instruct the compiler to generate instructions for all auxiliary symbols in the grammar by adding PREPARE_PARSETREE to the switches file.

 

In the Transform formalism, it is permitted to add context free rules with an empty right-hand side to the grammar, so-called empty rules. The compiler will also eliminate these empty rules from a grammar. At first sight such rules do not make any sense. However, they can play an important rule in combination with the tests that can be performed (see section 3.1.3.4 Actions in empty units). The non-terminal on the left-hand side of an empty rule will never appear in parse trees.

 

1.6     Contents of this manual

 

The Transform formalism will be presented in chapter 2 Basic Syntax and chapter 3 Extended Features. Because context sensitive grammars (used for recognition or transduction) have special characteristics, this grammar type will be discussed in chapter 4 Context Sensitive Grammars. Some remarks concerning the output of the run time system can be found in chapter 5 Run Time Input and Output files. Chapter 6 Switches (Directives) contains an overview of all possible options that can be added to the switches file. The commands for activating the compiler and the run time system are subject of chapter 7 Operation. The current limitations of the system will be listed in chapter 8 Problems and current limitations, together with some advice regarding the debugging of grammars. A description of the auxiliary tools can be found in chapter 9 Tools. The concluding chapter 10 Overview in which an overview is given of all the different aspects of the system can be used as a quick reference.


2     Basic Syntax

 

In this chapter, the three basic components of Transform grammars: meta symbols, terminal symbols and non-terminal symbols will be explained. The main power of the Transform formalism, however, lies in the extensions of the simple concepts. The advanced features will be the subject of the next chapter.

 

This reference manual, together with the user manual, uses the syntax of the W3C. The dissertation uses a syntax which we call “classic”. Both syntaxes may be used by calling a different metagrammar.

 

2.1     Meta symbols

 

Meta symbols in the Transform formalism are mainly punctuation marks. In section 1.3.1 (Meta language) we already introduced some of the meta symbols. They are listed below, together with some new ones which will be introduced in this section:

 

────────────────────────────────────────────────────

      meta symbol            function

────────────────────────────────────────────────────

               ::=                  standard rewrite sign between the left-hand side and right-hand

                                      side in context free or context sensitive rules

               <::                  rewrite sign in context sensitive rules

               ,                    concatenation

               |                    alternative

               .                    end of rule

               /*                  start of comment

               */                  end of comment

────────────────────────────────────────────────────

 

Context free rules with the standard rewrite sign are rewriting, or defining, rules. Context sensitive rules may be written with the standard rewrite sign ("::=") or with the context sensitive rewrite sign ("<::"). When a rule with a context free appearance (only one symbol on the left-hand side) is meant to be a context sensitive rule, the rule must be marked as such by the context sensitive rewrite sign ("<::").

 

We repeat here that the right-hand side of context sensitive rules may not be empty.

 

Restriction: Recognizing grammars with context sensitive rules must be compiled with the switch FINITE=FALSE.

 

The right-hand side of a context free rule contains the definition or production or realization of the left-hand side. There may be more than one realization. In this case, the grammar will contain several rules with the same left-hand side. In the Transform formalism, it is possible to collapse rules with identical left-hand sides; alternative right-hand sides may be linked by the alternative sign (vertical bar "|").

 

For example, the two rewrite signs ("::=" and "<::"), can be defined by two rules:

 

            RewriteSign      ::=         '::=' .

            RewriteSign      ::=         '<::' .

 

These rules can be collapsed to only one rule:

 

            RewriteSign      ::=         '::=' | '<::' .

 

            Remark: The concatenation sign (",") has precedence over the alternative sign ("|"). So the rule:

 

                        Symbols            ::=         Letter , Digit | Letter,  Letter .

 

            is equivalent with the two following rules:

 

                        Symbols            ::=         Letter,  Digit.

                        Symbols            ::=         Letter,  Letter .

 

            Remark: The use of the concatenation sign (",") is optional. It may be used for readability, e.g. when a grammar rule is written on one line. It may be left out, e.g. when each notion within a rule is written on a separate line.

 

A grammar may be supplied with comments.

A comment starts with a slash and a star ("/*") and ends with a star and a slash ("*/"). One can write any character between these begin and end markings, except the end combination star‑slash itself.

 

The main purpose of using comment in a grammar is documentation. In the development phase, for example while debugging a grammar with erroneous results, one can switch off one or more rules by enclosing them within comment markers.

 

Comments may be included at the beginning and at the end of the grammar, between rules and within rules (before and after each symbol or meta symbol).

 

Spaces, returns and form feeds between symbols and rules are also ignored, just like comments.

 

The only basic meta symbols not present in the list presented at the start of this section are the ones related to regular expressions. Regular expressions provide a compact notation of repetition in rules.

 

Style recommendations: Large grammars can become complex and difficult to expand. Although a user has to develop a clear style for writing grammars by himself, we can give some recommendations:

1.         Use comments to document the grammar.

2.         The structure of a grammar becomes more clear if blank lines between rules are added and tabs are used for indentation.

3.         Write every alternative on a separate line (unless the alternatives consist of a simple enumeration).

4.         Write all context sensitive rules with the "<::" sign.

 

As an example, the basic syntax of grammars is presented here as (an outline of) a context free grammar:

 

            Grammar                      ::=         Rule,  RestGrammar .

            RestGrammar               ::=         /* nothing */ |

                                                            Rule  RestGrammar .

            Rule                             ::=         LefthandSide,  RewriteSign,  RighthandSide,  Period .

            LefthandSide                 ::=         Symbol,  RestLefthandSide .

            RestLefthandSide          ::=         /* nothing */ |

                                                            ( Comma)?,  Symbol,  RestLefthandSide .

            RighthandSide               ::=         /* nothing */ |

                                                            Alternative,  RestRighthandSide .

            Alternative                    ::=         Symbol,  RestAlternative .

            RestAlternative             ::=         /* nothing */ |

                                                            (Comma)?,  Symbol,  RestAlternative .

            RestRighthandSide        ::=         /* nothing */ |

                                                            VerticalBar,  Alternative,  RestRighthandSide .

 

Remark: Note that the comments consisting of "/* nothing */" mark an empty alternative. For example, "RestGrammar" is defined as empty or as a rule again followed by "RestGrammar". In section 2.4 (Regular expressions) the combination of such alternatives to only one alternative by means of a regular expression will be explained.

 

2.2     Terminal symbols

 

The grammar given at the end of section 2.1 (Meta symbols) is not complete, because it does not contain terminal symbols. Terminal symbols are symbols that are not rewritten; they represent the characters and/or strings that must be found in the input. Therefore they never appear on the left-hand side of a context free rule.

 

In this section three different appearances of terminals will be discussed: single characters and strings, ranges and wildcards.

 

 

2.2.1        Characters and strings

 

All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:

 

            Vowel  ::=         'a' | 'e' | 'i' | 'o' | 'u' .

            Quote   ::=         '''' | '"'.

 

Note that the single quote itself must be doubled.

 

As an obsolete feature, single "name characters" may be represented literally (without single quotes), where "name characters" are defined as the alphanumeric characters and the underscore. So as an alternative of the first rule in the example given above, one can state:

 

            Vowel  ::=         a | e | i | o | u .

 

All other characters may not be represented literally; they must be surrounded by single quotes. Alternatively, as an obsolete feature, they may be preceded by the escape character (backslash "\"). The only exception to these representations is the character that denotes the end of line (also called the carriage return character or the newline character). There is however a reserved symbol which matches an end of line character (the actual value of which is operating system dependent): "cr" or "CR" (not "Cr"). This reserved terminal symbol (like all terminal symbols) may never occur on the left-hand side of a context free rule; this implies that one cannot redefine this reserved symbol.

 

All terminal symbols without exception may also be represented by the number representing their underlying code. A value in the (decimal) range 0’ to ‘255’ represents a character from the extended ASCII set, ISO 8859/1, as used under Windows and UNIX. (In the future, Transform will support wide-range character sets.)

The values have to be written in hexadecimal notation, like ‘#x61’. In Appendix B the characters and their ASCII code in three bases (octal, decimal and hexadecimal) are listed.

 

In summary, there are alternative ways to represent a single character in a grammar:

 

                                   

                                                           

"name characters" like "a"

"single quote" character

"end of line" character

other characters like "."

'a'

''''

CR

'.'

#x61

#x27

#x0D

#x2E

 

Sequences of terminal symbols may be given as a string between single quotes. For example:

 

            abcString          ::=         'abc' .

 

is another notation for:

 

            abcString          ::=         'a',  'b' , 'c' .

 

The matching of the characters in the grammar with characters in the input is case sensitive, so an "a" in the grammar matches only an "a" in the input and not an "A". If there is no need to distinguish between upper and lowercase in the input, one can add the switch CASE_CONVERT to the switches file.

 

There are text files with end of line characters that are not important for the structure of the text. The starting symbol of a grammar designed for a description of the structure of such a text covers input that can contain end of line symbols, but does not necessarily do so. One does not know in advance at which position an end of line symbol or a form feed can be expected. In such a situation the run time system can be instructed to skip the end of line symbols and form feeds by adding the switch IGNORE_CR to the switches file; a corresponding switch for blanks (spaces and tabs) is IGNORE_BLANKS.

 

It is possible to instruct the run time system to ignore other characters as well. For example, if you don't want to mention certain punctuation marks in your grammar, you can add the switch:

 

            SKIP_CHARS = '.' | ',' | '!' | '?' .

 

2.2.2        Ranges

 

With a range one can define a set of alternative terminals. A range consists of two terminal symbols separated by a hyphen. The range includes the two mentioned terminals plus those in between, defined by the order of the underlying character set (the ASCII set in the current implementation).

 

Examples:

 

            NameCharacter                        ::=         [a-z] | [A-Z] | [0-9] | '_' .

 

            OctalDigit                     ::=         [0-7] .

 

            Consonant                     ::=         [b-d] | [f-h] | [j-n] | [p-t] | [v-x] | 'z' .

 

Remark: Note the absence of single quotes around the name characters (letters, digits and the underscore) that mark the ranges.

 

A range may consist of one terminal symbol, for example:

 

            Underscore                   ::=         [_-_].

 

Ranges, just as ordinary terminals, may be defined in terms of their ASCII code, for example:

 

            Ascii                             ::=         [#x00-#xFF] .

            Brackets                       ::=         [#x28-#x29] .    /* "(" and ")" */

 

            Remark: The underlying numerical value of the first terminal must be less than or equal to the value of the second one. If this is not the case an empty range is defined (which is most probably not intended).

 

            Restriction: Ranges can not be used on the left-hand side of context sensitive rules.

 

Note: In other syntaxes, like the W3C ENBF syntax for XML, it is possible to define regular expressions of characters. We support the following forms:

- [a] (one constant) and [a-z] (range of constants);

- [#xN] (one constant) and [#xN-#xN] (range of constants);

- also, the negated variants are supported: [^a], [^a-z], [^#xN] and [^#xN-#xN].

Other variants can easily be dealt with, e.g. [0-9a-z] can be written as [0-9] | [a-z], because of our notation for regular expressions. In our opinion, this notation is more readable.

If required, a future version of Transform will support the syntax of [0-9a-z].

Furthermore, we do not deal with the special treatment of '-' and ']' in the W3C ENBF syntax when used as the first character following the '['.

 

2.2.3        Wildcards

 

Wildcards are meta symbols which match arbitrary characters (including carriage return, form feed etc.). There are three different wildcards, with different functions: a dontcare, a line and an arb symbol. They are listed below, together with their functions.

 

symbol

name

function

*

dontcare

matches exactly one arbitrary character

=

arb

matches zero or more arbitrary characters, not including the character following the arb symbol in the grammar

line

matches any sequence of zero or more arbitrary characters, including occurences of the character following the line symbol in the grammar

 

For example:

 

            Password          ::=         *,  *,  *, *,  *,  * .

 

defines a password as a sequence of six arbitrary characters. And in:

 

            Comment          ::=         '/*',  =,  '*/' .

 

a sequence of characters which starts with ‘/*’ and ends with ‘*/’ is recognized as a comment. The characters which are matched by the arb symbol will not contain a ‘*’. An alternative definition of comment:

 

            Comment          ::=         '/*',  ‑,  '*/' .

 

defines the same sequence of characters, but the characters which are matched by the line symbol may contain one or more ‘*/’. For example:

 

            /* a comment is marked by a "*/" at the end */

 

will be matched twice by the rule with a line symbol. It will only be matched once by the rule with an arb  symbol.

 

            Remark: The arb symbol matches any sequence of characters, not including the character following it in the grammar rule. If the arb symbol is followed by a string of terminal characters, it is only the first character of this string which defines the operation of the arb symbol. For example:

 

            Restriction 1: When the compiler option FINITE=FALSE is used an arb symbol may only be followed by a terminal symbol or by an intermediate symbol (see section 3.2.4 Intermediate symbols). It may not be followed by a non-terminal symbol or a lexicon symbol (see section 3.2.2 Lexicon symbols) and it may not occur at the end of a rule. If FINITE=TRUE, there are no restrictions at all and an arb symbol may even be the last symbol in a grammar rule.

 

            Restriction 2: Wildcards can not be used on the left-hand side of context sensitive rules.

 

2.3     Non-terminal symbols

 

Non-terminal symbols are symbols that occur at least once on the left-hand side of a context free rule. Non-terminal symbols function as names, or abbreviations, for sequences of symbols. Their names consist of a sequence of one or more alphabetic characters (uppercase or lowercase), the digits 0 to 9 and the underscore character ("_"). The maximum length of a non-terminal name is 32 characters.

 

            Remark 1: The names of non-terminal symbols are case sensitive. For example, a symbol defined as "comma" does not match any other spelling of the name, such as "Comma" or "COMMA".

 

            Remark 2: A symbol name consisting of only one character is handled as a terminal unless this name occurs at least once on the left-hand side of a context free rule (then it is considered a non-terminal). Therefore, beware of non-terminals named with only one character.

 

            Style recommendation: To avoid possible confusion with terminals, it is recommended that the names of non-terminals contain at least two characters.

 

Non-terminals defined by one or more unit rules will not automatically appear in parse trees when the run time switch BUILD_PARSETREE has been set to true. Only after setting the compiler switch PREPARE_PARSETREE to the value TRUE will the resulting parse trees contain every appropriate non-terminal. Examples of unit defined non-terminals can be found in the next rules:

 

            AlfaNum          ::=         Alfa | Number .

            Alfa                  ::=         [a-z] | [A-Z] .

            Number                        ::=         ( [0-9] )+ .

 

If the switch PREPARE_ PARSETREE is set to FALSE, "Alfa" and even "AlfaNum" will not appear in any parse tree at all; "Number" will only be present if it matches input consisting of more then one digit.

 

A built-in check on typing errors in non-terminal names is present to some extent. The compiler considers a name consisting of more than one character as an "intermediate symbol" (this symbol type will be introduced in section 3.2.4  Intermediate symbols) unless it appears at least once on the left-hand side of a context free rule. If intermediate symbols are present, their names appear on the screen and in the error file; misspelled names of non-terminals can appear amongst them.

 

Grammars for recognition and/or parsing contain at least one non-terminal: the symbol on the left-hand side of the first rule in the grammar. This special non-terminal is also called the starting symbol.

 

Transduction grammars can be written without any non-terminals, because they do not contain a starting symbol. For example, the rule:

 

            'terminal symbol' <:: 'terminal' .

 

can be compiled as a grammar but contains only terminals. When applied, it converts all the words "terminal" in a document (for instance this manual) to the words "terminal symbol".

 

 

2.4     Regular expressions

 

Regular expressions provide a compact notation of repetition and embedded alternation:

 

─────────────────────────────────────────────────────

expression      function

─────────────────────────────────────────────────────

x                    matches exactly one occurrence of "x"

( x )?             matches zero or one occurrence of "x"

( x )*             matches zero or more occurrences of "x"

( x )+             matches one or more occurrences of "x"

( x | y )?         matches zero or one occurrence of either "x" or "y"

( x | y )           matches exactly one occurrence of either "x" or "y"

( x | y )*         matches zero or more occurrences of "x" and/or "y"

( x | y )+         matches one or more occurrences of "x" and/or "y"

─────────────────────────────────────────────────────

 

In these examples of regular expressions, "x" and "y" may be replaced by a terminal or non-terminal symbol, by a sequence of terminal and/or non-terminal symbols or by another regular expression.

 

The grouping notation (...)  is especially useful if one wishes to overrule the precedence of the concatenation sign over the alternative sign. For example, if the definition of the non-terminal "Symbols" in the example on page 15, was meant to be:

 

            "a string of three characters, the first and the last a letter and between them a letter or a digit",

 

this can be written as:

 

            Symbols            ::=         Letter,  ( Digit | Letter ) , Letter .

 

As an example, the grammar for Transform grammars (earlier presented in section 2.1 Meta symbols), is given here in a more compact notation:

 

            Grammar          ::=         ( Rule )+ .

            Rule                 ::=         LefthandSide,  RewriteSign,  RighthandSide, Period .

            LefthandSide     ::=         Symbol,  ( Comma,  Symbol )* .

            RighthandSide   ::=         Alternative,  ( VerticalBar,  Alternative )* .

            Alternative        ::=         Symbol,  ( Comma,  Symbol )* .

 

            Restriction: Regular expressions can not be used on the left-hand side of context sensitive rules.


3     Extended Features

3.1     Actions

 

Actions relate to extensions of the formalism. There are two kinds of actions:

 

(1)        assignments and tests (section 3.1.1 Assignments and tests)

(2)        messages (section 3.1.2 Messages )

 

The general rules for actions are:

 

      Actions are written between braces "{ ...}".

      Apart from assignments and tests, an action starts with a keyword determining the type of the action. These keywords are "r:" and "s:" and may be written in upper or lowercase without difference in meaning.

      Actions may be conjoined by the conjunction operator (semicolon ";").

      Actions may only be used on the right-hand side of a context free rule. They may occur:

    after each symbol or regular expression

    between the ")" and the " * " or "+" of a regular expression

    as the first (or only) element of an alternative.

 

For some types of actions the effects depend on their position in the rule. In section 3.1.3 (Places of actions in rules) the different possibilities are discussed together with a description of the effects.

 

 

3.1.1        Assignment and tests

 

A powerful mechanism is the application of tests: the recognition of input can be made dependent of a testable criterion. This feature can be used successfully in grammars that describe ambiguous phenomena and can limit the number of alternative parse trees. They are very useful in combination with the symbol type "non-terminals with parameters", to be introduced in section 3.2.1 (Non-terminals with parameters).

 

Tests are performed on variables and literals, which can be combined together into expressions (section 3.1.1.1 Variables, literals and expressions). Before a test can be performed, variables must be given a value by means of an assignment (section 3.1.1.2 Assignment). In section 3.1.1.3 (Types of tests) different types of tests are discussed and section 3.1.1.4 (Combinations of tests and/or assignments) concerns possible combinations of tests and/or assignments.

 

 

3.1.1.1    Variables, literals and expressions

 

In Transform grammars, two kinds of variables can be used: local variables and parameters. The main difference between the two types lies in their scope; the scope of a local variable is the rule in which it occurs, a parameter can be used to transmit information between different rules. In this section, local variables are introduced and although parameters will be explained later on (see section 3.2.1 Non-terminals with parameters), all operations that can be performed on local variables are applicable to variables introduced as parameters as well.

 

Every rule, except context sensitive rules, may be extended with local variables. Local variables are devices which can be used for administrative purposes; information (in the form of a string) available at some point in the rule can be stored in a variable so that it can be used at another point in the same rule. A variable has a name (consisting of name characters, see the first example in section 2.2.2 Ranges) and a value (in the Transform system always a string).

 

A local variable need not be declared; the variable will be allocated at the moment of its first reference in the rule. At the moment of its allocation, it contains the empty string and this remains its content up to the moment that another value is assigned to it.

 

There is a reserved variable name (percent "%"). Its value is always the character that has most recently been read.

 

Besides variables, literal string values (also called literals) can be used in actions. They also store string information, but their value can never be changed. Between the braces ("{-}") of actions, literals are strings of characters between single quotes. The empty string can be represented by merely two quotes. To represent a single quote as part of a literal, it should be written twice.

 

Expressions can be built from variables and literals. This is done with the concatenation operator "||". The value of an expression is the concatenated value of all its components.

 

Remark 1: In the next sections the term "expression" refers also to its most simple form: a simple variable or a simple literal.

 

            Remark 2: In a grammar with many variables, a typing error in the name of a variable is easily made. Because the compiler treats every new name as an occurrence of a new variable, an error in a name can cause incorrect behavior in the compiled grammar. If the switch CHECK_VARIABLES is added to the switches file, all separate variables found in a rule will be listed in the compiler error file.

 

As an example, we give the value of some expressions that can be composed from the following components:

 

            var1 with the value "aa",

            var2 with the value "bb",

            % with the value 'c' and

            the literal 'dd':

 

expression

resulting value

var1 || var2

"aabb"

var1 || %

"aac"

'dd' || var2

"ddbb"

var1 || % || var1

"aacaa"

 

In Transform notation the definitions of the concepts, introduced in this section, are:

 

            /* Definitions for variables, literals and expressions */

 

            Variable                        ::=         VariableName | '%' .

            Literal                           ::=         ''''  ( Token )*  '''' .

            VariableExpression        ::=         VariableOrLiteral,  ( '||',  VariableOrLiteral )* .

 

            VariableOrLiteral          ::=         Variable | Literal .

            VariableName               ::=         ( NameCharacter )+ .

            Token                           ::=         [#x00-#x26] | [#x28-#xFF] | [#x27-#x27].

            /* #x27 is the value of a single quote, which must be doubled */

 

3.1.1.2    Assignment

 

A variable may be assigned a value by means of the assignment operator ("="); in general:

 

            { <variable> = <expression> }

 

which means that after evaluation the variable has the value specified by the  expression.

 

In the Transform notation, the definition of an assignment is:

 

            Assignment       ::=         Variable,  '=',  VariableExpression .

 

            Remark: Explicit assignment to the reserved "%" is not possible; this variable is assigned its value, the last read character, in run time.

 

Following the meaning of the concept expression (see section 3.1.1.1 Variables, literals and expressions), the value of a variable can be set in different ways, for example:

 

            {var1=var2} or {var1=%}         (assignment of a variable)

            {var1='abc'}                             (assignment of a literal)

            {var1=var2||'abc'}                     (assignment of a variable expression)

 

The value of a variable may be empty at any time. Remember that a variable has an empty value at the moment of its first appearance in a rule; its value can become the empty string again by assigning the literal representing the empty string to it. Therefore:

 

            {var=''}

 

removes the value of the variable "var".

 

 

3.1.1.3    Types of tests

 

Tests concern the truth value of a comparison between two expressions. If this value is false, the current recognition path stops. There are two test operators, one for equality and one for inequality.

 

The equality test:

 

            { <expression1> == <expression2> }

 

returns the value true if "expression1" has the same value as "expression2", while the inequality test:

 

            { <expression1> != <expression2> }

 

returns the value true if "expression1" refers to a value other than that specified by "expression2".

 

The definition of tests in Transform notation becomes:

 

            Test                  ::=         EqualityTest   |

                                                InequalityTest .

            EqualityTest      ::=         VariableExpression,  '=',   VariableExpression .

            InequalityTest   ::=         VariableExpression,  '!=',  VariableExpression .

 

Disjunct tests may be stated by means of the "or"-operator:

 

            { <test_1> | <test_2> | ... | <test_n> }

 

is true when the evaluation of at least one of the tests leads to the value true. For example:

 

            { var1== '' | var2==% }

 

is true if the value of the variable "var1" is empty or if "var2" has the value of the last read character in the input.

 

If the combined test concerns the same variable on the left-hand side, a shorthand notation is possible. For example:

 

            {var1==expr1 | var1==expr2}

 

can also be written as:

 

            {var1==expr1 | ==expr2} .

 

            Remark 1: A series of disjunct tests must be considered as a combined test with one value; the value is true if and only if one of its components is true.

 

            Remark 2: The "or"-operator can only be used in combination with tests; a compiler error message will be generated if this operator is used between other actions like assignments and/or messages.

 

 

3.1.1.4    Combinations of tests and/or assignments

 

Multiple assignments may be conjoined by the conjunction operator (";" or "&"):

 

            {var1=expr1;  var2=expr2}        or         {var1=expr1 & var2=expr2}

 

Multiple tests may also be conjoined by the conjunction operator:

 

            {var1==expr1 & var2!=expr2}   or         {var1==expr1;  var2!=expr2}

 

is true when all of the tests are true.

 

            Remark: Because a series of disjunct tests has to be considered as one test with one resulting value, in a combination of the "or"-operator with the conjunction operator the former has precedence over the latter. Therefore, the compound test:

 

                        {var1=='1' | var2=='2' & var3=='3'}

 

            only succeeds when var3 has the value '3' and either var1 has the value '1' or var2 has the value '2' (or both).

 

A mixture of tests and assignments, just as other actions, may also be conjoined by means of the conjunction operator:

 

            {var1==expr1;  var2=expr2}      or        {var1==expr1 & var2=expr2}

 

            Style recommendation: It is recommended that the "&" variant is used for the conjunction of two tests and the ";" variant for the conjunction of two actions where at least one of the actions is not a test. For example:

 

                        {var1==expr1 & var2==expr2;  var3=var1||var2}

 

In this manner the Boolean outcome of the conjunction of two tests is emphasized by means of the sign "&" since the ampersand is often used in mathematics to stand for the Boolean "and"-operator.

 

            Remark: The combination of a test and an assignment has the appearance of an "if‑then" statement in a programming language, but note that if the test fails, the parse fails as well.

 

            You can solve this situation by adding an identical alternative to the rule in which you change the test to the opposite meaning (i.e. "==" becomes "!=" and "!=" becomes "=="). For example, consider the following sketch of a rule:

 

                        aa         ::= ...  ( bb | cc {p='1'} ) {p=='1'; q='2'}  ... .

 

            If you intend this rule to mean "if variable 'p' has obtained value '1', the variable 'q' becomes '2', else 'q' remains unchanged", you should change it as follows:

 

                        aa         ::= ...  ( bb | cc {p='1'} ) {p=='1'; q='2'}  ... |

                                          ...  ( bb | cc {p='1'} ) {p!='1'}           ... .

 

            This problem will become more obvious when non-terminals with parameters are introduced.

 

As a final example, we present a grammar for palindromes (lines that are the same when read from right to left). Such a grammar could be written using local variables:

 

            Palindrome        ::=         ( Char {head = head || %} )+  ( Char )? ,

                                                            ( Char {tail = % || tail} )+ {head == tail} .

 

 

3.1.2        Messages

 

Messages can be used to mark the relevant points in the recognition process by which other (external) processes can be triggered. Messages provide for event-based processing in general purpose programming languages.

When messages are used in a grammar, an output file with the extension .REP (also called "report file") is created in which the messages are written. There are two variants: simple messages (section 3.1.2.1 Simple messages) and messages with variables (section 3.1.2.2 Messages with variables).

 

 

3.1.2.1    Simple messages

 

There are two message functions: signal and report. A message is characterized by a function indication ("s" for signals and "r" for reports) followed by a colon and a positive, non-zero number:

 

            { s: <number> }            or         { r: <number> }.

 

Also, in Transform notation:

 

            Message           ::=         Signal | Report .

            Signal               ::=         'S_or_s', ':' , Number .

            Report              ::=         'R_or_r',  ':',  Number .

            Number            ::=         [1-9]  ( [0-9] )* .

            R_ or_ r                       ::=         'R' | 'r' .

            S_ or_ s                        ::=         'S' | 's' .

 

            Remark: The number "0" cannot be used as a signal or report.

 

Signals or s‑messages only report the number and a colon (":"); reports or r‑messages report the number followed by a colon plus the most recently matched terminal symbol. In the case of subsequent r‑messages with the same report number, the report number is given once and the reported symbols are concatenated (even if they were not contiguous in the input).

 

R‑messages always output the character that has most recently been read. Thus, if a report follows a symbol or regular expression that covers no input (as may be the case in the "( )" and "( )* " expressions), the character that was read before the regular expression will be output.

 

In the grammar:

 

            String                ::=         ( Word {S:2},  Blank )+ .

            Word                ::=         ( Char )+ .

            Char                 ::=         [a-z] {R:1} .

            Blank                ::=         #x20 .

 

any sequence of lowercase alphabetic characters will be recognized as a string and will return as output a report file with two lines. The first line will consist of the report number "1" followed by a colon and all recognized characters, since subsequently reported terminal symbols with the same number are concatenated. The second line will consist of the message number "2" followed by a colon (and nothing else). For example, for the string "abc def " the output will be:

 

            1:abc

            2:

            1:def

            2:

 

If, in the same grammar, the signal "S:2" in the first rule is omitted, the output of the same input will be:

 

            1:abcdef

 

The characters written to the report file by means of r‑messages will not be converted to a readable format. This only has consequences if the input file contains control characters (characters representing the decimal ASCII codes smaller than 32 or greater than 126) and these characters are emitted by r‑messages. A character in the input matching a terminal symbol specified in the grammar as:

 

 ... #x04 {R:4} ....

 

will result in a non-printable character in the report file after the number "4:". To avoid non-printable report files, a signal can be emitted instead of a report for those control characters (for example with a number equal to its ASCII code):

 

 ... #x04 {S:4} ....

 

So the general rule is: characters emitted by r‑messages are equal to the corresponding characters in the input. The only exception occurs when the switch CASE_ CONVERT has either the value LOWER_ TO_ UPPER or the value UPPER_ TO_ LOWER. In those cases letters will not be emitted in their original format but in their converted one.

 

            Remark 1: Although the carriage return character can be represented in the grammar as "cr", the emission of this character to the report file causes a carriage return.

 

            Remark 2: If a parse is ambiguous, messages can be produced in a unpredictable order (see also section 5.4.1 Report of messages).

 

More examples of messages in a grammar rule will be given in section 3.1.3 (Places of actions in rules).

 

 

3.1.2.2    Messages with expressions

 

A variant of the simple r‑messages or s‑messages can be used to communicate expressions. In this use, the report or signal number is followed again by a colon followed by a variable expression:

 

            {r:<number>:<expression>}       or         {s:<number>:<expression>}.

 

For more details about the use of variables and expressions, see section  3.1.1.1 (Variables, literals and expressions).

 

The effect of this action is that as soon as the input is recognized up to the point of the action, the current value of the expression will be written to the report file (preceded by the number and a colon). The same difference in effect that applies for simple reports and signals is found for signals and reports with a variable expression; the resulting expressions of an r‑message are concatenated as long as the same report number is emitted (the report number itself appears only once in the output). For this reason the signals with expressions are of more practical use; each separate emission appears on a separate line in the report file.

 

Examples:

 

            {r:1:var1}

            {r:2:'abc'}

            {r:3:var1||'abc'}

 

The report files for the examples above will contain respectively:

 

            1:...                   (the value of the variable with the name "var1")

            2:abc                (the literal string "abc")

            3:...abc             (where “…" stands for the value of the variable with the name "var1")

 

            Remark: Due to an error in the run time system, messages with expressions sometimes yield incorrect output when used inside a repeated regular expression.

 

 

3.1.3        Places of actions in rules

 

Actions can be written after symbols (terminals, non-terminals and the symbol types that will be introduced later on), after regular expressions and between the closing bracket of a regular expression and the repetition indicators star ("*") or plus ("+"). There is also a possibility for an alternative consisting of only an action; this alternative must be the first one. These possibilities will be discussed in the sections below.

 

 

3.1.3.1    Actions following symbols

 

Actions written after a symbol (terminal, non-terminal or one of the symbols to be introduced in section 3.2 Symbol type extensions), will be performed as soon as the symbol has matched the input. For example, in the rules:

 

            Input     ::=         ( Word {var1=%};  Blank {R:1:var1} )+ .

            Word    ::=         ( [a-z] )+ .

            Blank    ::=         #x20 | #x0B .  /* space or tab */

 

upon recognition of the input "horse", the variable "var1" will be assigned its final value ("e"); this value will be emitted to the report file after recognition of the space.

 

Actions will also be executed when the symbol itself has covered no input. For example in the rules:

 

            Number            ::=         Digit,  Mantisse {R:1} .

            Digit                 ::=         [0-9] .

            Mantisse           ::=         ( '.'  Digit )? .

 

the output in the report file of the input "2" will be "1:2", because the last read input character was a "2".

 

However, if the action is placed after a symbol inside a regular expression that can cover no input the action is not performed. Therefore in:

 

            Input                 ::=         ( ( Determiner {S:1},  Blank )?,  Noun,  Blank )+.

            Determiner       ::=         'a' | 'the' .

            Noun                ::=         'text' | 'book' .

 

upon the input "a text book " the report file will consist of only one "1:".

 

As a special case, we consider actions after wildcards. Actions placed after a wildcard that can cover a whole sequence of characters will be repeated as many times as the wildcard matches the input. For example in:

 

            Comment          ::=         '/*', = {R:1}, '*/'.

 

the report file will contain the number "1", followed by a colon and all terminal symbols matched by the "arb" are reported. The same effect is achieved when a message follows a "line" ("‑"). The dontcare symbol matches only one character, so an r‑message placed behind the dontcare symbol will also report only one character.

 

            Remark: If a wildcard ("arb" or "line") matches no terminal symbols, the "last read character" will contain the character that was matched before the wildcard. An illustration in the case of r‑messages can be found in the preceding example (a ‘*’ will be emitted). The same can be said about the reserved variable "%", occurring in actions after a wildcard.

 

 

3.1.3.2    Actions following regular expressions

 

Actions following a regular expression are executed only when the regular expression has ceased matching input. Thus, in the case of a r‑message, the most recently matched character is reported. It is not the string which has been matched by the regular expression.

 

For example, in the grammar:

 

            String                ::=         ( Word,  Comma )+ {R:1} .

            Word                ::=         ( [a-z] )+ .

            Comma             ::=         ',' .

 

the input "abc,pqr," will result in the following report file:

 

            1:,

 

A special case is formed by messages, assignments and tests containing the reserved variable "%" following a regular expression that does not cover any input. The same which has been said before about actions after symbols that cover no input, also holds for the regular expressions "(…)?" and "(…)* ". In the next grammar with messages, three different input strings are considered to illustrate the effects:

 

            Input     ::=         Bs,  ( 'a' )* {R:1} .

            Bs        ::=         ( 'b' )* {S:2} .

 

            input                 output (in the report file)

            ─────────────────────────────

            aa                                 2:

                                                1:a

 

            bb                                 2:

                                                1:b

 

            bbaa                             2:

                                                1:a

 

 

3.1.3.3    Actions as part of the repetition of regular expressions

 

An action between the ")" and the " * " or "+" differs from the situation in which the action is placed after the " * " or the "+" that concludes the regular expression.

 

For example, in the grammar:

 

            String    ::=         ( Char ){R:1}+ .

            Char     ::=         [#x00-#xFF] .

 

the report is given when the expression is repeated or left. Thus, the report file will consist of the number "1" followed by a colon and the recognized characters, which are concatenated.

As an illustration of the difference between a message action placed before and a message placed after the " * " of a regular expression, we consider the effects of two slightly different grammars with a r‑message:

 

                                    Example 1                                             Example 2

 

                                    Ss         ::= Digit, ( Char ){R:1} * .          Ss         ::= Digit, ( Char )* {R:1} .

                                    char      ::= 'a' | 'b' .                                Char     ::= 'a' | 'b' .

                                    Digit     ::= [0-9] .                                  Digit     ::= [0-9] .

 

            input                 output example 1                                   output example 2

            ─────────────────────────────────────────────────

            1                      (no output created)                                1:1

            1a                     1:a                                                        1:a

            1ab                   1:ab                                                      1:b

 

 

3.1.3.4    Actions in empty units

 

As explained above, variables placed in regular expressions that can be skipped remain unchanged. With the help of an action in an otherwise empty unit, one can force a value to be given to a variable, for example:

 

            QuotedString     ::=         {q='"'}   ( [a-z] {q=q||%} )+ {R:1:q||'"'} .

 

            Remark 1: Only one empty unit may be added to a rule or a regular expression and that must be the first one. The compiler halts with an error when a rule contains more than one empty unit or when the empty unit is written in an illegal position.

 

            Remark 2: In order to simulate an "if‑then‑else" construction for tests (see also section 3.1.1.4 Combinations of tests and/or assignments), one can make use of so-called empty rules. For example, as an alternative for the example in that section, one could write:

 

                        Aa        ::= ...  ( Bb        |

                                                Cc {p = '1'}

                                                ) ,

                                                            ( Empty {p ='1'; q = '2'} |

                                                              Empty {p!='1'}

                                                            ) .

                        Empty   ::=         .

 

PMSUB2.WP bevat: - chapter 2: BASIC SYNTAX


3.2     Symbol type extensions

 

In the basic syntax, only two symbol types were distinguished: terminals and non-terminals. In this section, other types of symbols are introduced; some are entirely different from the two basic ones, other are derivates. In section 3.2.1 non-terminals with parameters  will be introduced. The use of parameters or features transforms a grammar into an attribute grammar. Related to the use of parameters is the introduction of a new symbol type: the lexicon symbols. A description can be found in section