LingDoc Transform

 

A tool for document and language engineering

 

 

 

 

 

 

User Manual

 

Version 3.2

 

 

 

 

 

 

 

 

 

Palstar

Uffelte

The Netherlands

www.lingdoc.eu

www.palstar.nl

 


Contents

 

1........... In a nutshell 5

1.1      What is LingDoc Transform.. 5

1.2      About this manual 5

2........... Working Procedure. 5

2.1      SPECIFY.. 5

2.2      PARSE. 6

2.3      ADJUST. 6

2.4      CONVERT. 6

3........... Install Transform.. 6

4........... Directory structure. 6

5........... Explore the Interface. 6

5.1      Gui: Basic Mode. 6

5.2              7

5.3      Gui: Select Grammar…... 7

5.4      Gui: Compile Grammar…... 7

5.5      Gui: Select Inputfile…... 8

5.6      Gui: Run Grammar…... 9

5.7      Gui: Advanced mode…... 9

5.8      Use your favorite editor 10

6........... Operation. 10

7........... Syntax: EBNF grammars and Regular Expressions. 10

7.1      The formalism of Transform in a technical nutshell 10

7.1.1      Recognition grammars 11

7.1.2      Transduction grammars 11

7.1.3      When to use recognition and when transduction grammars 11

7.2      EBNF. 12

7.2.1      Symbols 12

7.2.2      Characters 12

7.2.3      Strings 13

7.2.4      Ranges of characters 13

7.2.5      Wildcards 13

7.2.6      Non-terminal symbols 13

7.2.7      Regular expressions 14

7.3      More reading. 14

8........... A running example. 14

9........... Example 1: getting started. 14

9.1      Source document (Ex_1.txt) 14

9.2      DTD (Ex_1.dtd) 14

9.3      Grammar (Ex_1.grm) 14

9.4      Compile and parse Ex_1. 15

9.5      Experiment 16

10.......... Syntax: extensions for output, variables and expressions. 16

10.1     Introduction. 16

11.......... Syntax: Output 16

11.1     Introduction. 16

11.1.1     Simple messages 16

11.1.2     Messages with variables 17

11.2     More reading. 17

12.......... Example 2: Output 17

12.1     Source document (Ex_2.txt, same as Ex_1.txt) 17

12.2     DTD (Ex_2.dtd, same as Ex_1.dtd) 18

12.3     Grammar (Ex_2.grm, as an extension to Ex_1.grm) 18

12.4     Target document (Ex_2.xml) 18

13.......... Syntax: Variables and expressions. 18

13.1     Introduction. 18

13.2     Variables, literals and expressions 18

13.3     Assignment 19

13.4     Tests on expressions 19

13.4.1     Introduction. 19

13.4.2     Types of tests 19

13.4.3     Combinations of tests and/or assignments 20

13.5     Parameters 20

13.6     More reading. 21

14.......... Example 3: Variables. 21

14.1     Source document (Ex_3.txt) 21

14.2     DTD (Ex_3.dtd) 21

14.3     Grammar (Ex_3.grm) 21

14.4     Target document (Ex_3.xml) 22

15.......... Methodology for up-conversion: from DTD to grammar 22

15.1     Introduction. 22

15.2     SPECIFY.. 22

15.3     PARSE. 23

15.4     ADJUST. 23

15.5     CONVERT. 23

16.......... Example 4: The complete conversion. 23

16.1     Introduction. 23

16.2     Source document 23

16.3     DTD.. 24

16.4     Target document in XML. 24

16.5     Grammar 25

17.......... Syntax: Extensions of context-sensitive grammars. 28

17.1     Introduction. 28

17.2     Example. 28

18.......... Syntax: Extensions of transduction grammars. 28

18.1     Introduction. 28

18.1.1     Cover symbols 29

18.1.2     Intermediate symbols 30

18.1.3     Ambiguity and "looping" problems in transduction grammars 30

18.1.4     Negation. 30

18.2     More reading on transduction grammars 30

19.......... Example 5: Floating elements. 31

19.1     Introduction. 31

19.2     Source document (Ex_5.txt, same as Ex_4.txt) 31

19.3     Grammar (Ex_5_td.grm) 31

19.4     Target document (Ex_5.tdo) 32

20.......... Example 6: Two-stage conversion. 33

21.......... Syntax: Cascaded grammars. 33

21.1     Introduction. 33

21.2     Example. 33

22.......... Example 7: Cascaded grammars. 33

22.1     Introduction. 33

22.2     Gui: Cascade linker 34

22.3     Instructions 34

23.......... Syntax: lexicons. 34

24.......... Example 8a: the construction of a lexicon. 35

24.1     Maintenance of a lexicon. 35

24.2     Lexicon for the running example. 35

24.3     Gui: compilation of a lexicon. 35

24.4     Instructions for compiling a lexicon. 36

25.......... Example 8b: conversion with a lexicon. 36

25.1     Source document 36

25.2     Grammar 36

25.3     Instructions for using the lexicon. 37

26.......... Example 9: Natural language parsing. 37

26.1     Introduction. 37

26.2     The input 37

26.3     Switches used. 38

26.4     Source document (ex_9.txt) 38

26.5     Grammar (ex_9.grm) 38

26.6     Lexicon (ex_9_lex.txt) 39

26.7     Instructions for running the example. 40

26.8     Parsetrees 40

27.......... Example 10: Structure sensitive Natural Language parsing. 40

27.1     Introduction. 40

27.2     Switches used. 40

27.3     Source document (ex_10.txt) 41

27.4     Grammar (ex_10.grm) 41

27.5     Lexicon (same as ex_9_lex.txt) 43

27.6     Parsetrees (ex10.out 43

28.......... Syntax: Ambiguity and non-determinism.. 43

28.1     Introduction. 43

29.......... Example 11: Ambiguity. 44

29.1     Introduction. 44

29.2     Source document (ex_11.txt) 44

29.3     Grammar (ex_11.grm) 44

29.4     Lexicon (ex_11_lex.txt) 44

29.5     Parsetrees (ex11.out) 45

30.......... Miscellaneous. 46

30.1     Run Time Input and Output files 46

30.2     Error recovery. 46

30.3     Switches. 46

30.4     Drawing (ambiguous) parsetrees 46

30.5     Drawing (ambiguous) transduction output 46

30.6     Debugging grammars 46

31.......... Contact 46

 

1     In a nutshell

1.1     What is LingDoc Transform

LingDoc Transform (in short: Transform) is a general-purpose system for the analysis of natural language texts and for the conversion of documents. It is introduced in the “Management Summary of LingDoc Transform” and in a number of position papers.

1.2     About this manual

This User Manual tries to explain the working of Transform in the shortest way possible and provides for immediate hands-on experience. It explains the syntax and shows a number of examples in the form of exercises. You may modify the exercises and try them out on your computer. There is one running example: the analysis and conversion of an aerospace manual into XML, together with the linguistic analysis of its text. The running example gives a comprehensive overview of the capabilities for up-conversion to Xml and for language engineering tasks. Along the way references are made, by hyperlinks, to the accompanying Reference Manual. That manual gives a rather exhaustive description of the complete system in its own pace.

This User Manual is aimed at the more experienced user who already has a basic understanding of the syntax of XML DTD’s, EBNF and regular expressions. We advise the impatient reader, experienced or not, to follow this User Manual and to use the hyperlinks in case something is not clear or not precise enough. We welcome any suggestions for further improvement of these manuals.

2     Working Procedure

The Transform specify-parse-adjust-convert cycle enables you to easily specify the structure of a document, parse it, adjust it and then convert it to valid XML. Here’s how it works:

 

3. ADJUST GRAMMAR AND DOCUMENTS

2. PARSE DOCUMENTS

4. CONVERT DOCUMENTS TO XML

1. SPECIFY GRAMMAR FOR DOCUMENTS

 

 

 

 

 

 

 

 

 

 

 


2.1     SPECIFY

Obtain a number of source documents together witch their target documents in XML together with the document schema. Convert the schema, in a one-to-one correspondence, into a grammar in the W3C notation. The end symbols of the grammar are the characters in the source documents. Add regular expressions to fine-tune to the peculiarities of the source documents.

 

2.2     PARSE

Compile the grammar into a parser. Parse the documents and let errors be reported.

 

2.3     ADJUST

Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.

 

2.4     CONVERT

Add output-specifications to the grammar. The documents will be converted into XML. Alternatively, whichever is easier, specify in Transform a second type of grammar which contains rewriting patterns. Or, use your own rewriting program. In the last case, you use Transform mainly as a tool for getting documents correct, using the cycle 1-2-3.

 

3     Install Transform

For the installation of Transform from the CD: follow the instructions in the file Install.txt.

4     Directory structure

Two directories will be available: Transform and Java. The directory Java will only be used by the Gui. The directory structure of Transform is as follows (a larger font indicates its relevance for the user):

 

BIN

contains the executables for the system Transform

ETC

contains a text file for error-messages, to be used by the system

Examples_more

contains additional examples, referred by the User Manual

Examples_myself

the directory you will work with

Examples_User_Manual

initially the same as Examples_myself; in case you mess up your work you can always copy the original

GRM

contains the bootstrap grammars for the system (the formalisms used within the system are described by Transform grammars themselves)

Gui

contains the java classes for the graphical user interface

Manuals

contains the User Manual and the Reference Manual

several .bat files

for running the system from the Gui or in batch

click-for-Transform

a Windows pif file; click on it to start the Gui; you may copy this pif file to the desktop for easier use

read.me

instructions for adjusting path names; initially, there is no need to do this

5     Explore the Interface

 

Start the Gui of Transform by clicking on “click-for-Transform”.  This brings you in the main menu, also called “Basic Mode”. The menu contains comments on each of the six buttons. The buttons are further explained hereafter.

5.1     Gui: Basic Mode

5.2    

 

 

 

 

5.3     Gui: Select Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

5.4     Gui: Compile Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.5     Gui: Select Inputfile…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.6     Gui: Run Grammar…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.7     Gui: Advanced mode…

The behavior of the compiler and the run time system can be adapted to the wishes of the user; some options can be specified to modify the behavior of the programs to some extent. The options are set by switches in the GUI. Nearly all switches may be specified and altered using the Advanced mode.

 

The switches are kept in a so-called switches file, one for the compiler and one for the run time system. The switches files have the extension .SW. Each switch has a default value; if a switch value is not specified by the user, this default value is assumed.

A switches file is maintained automatically by the GUI, but can also be altered by hand.

 

Some settings of switches are related. The GUI takes care of the automatic change of the setting of a switch in case the setting of a related switch is altered.

More reading on switches.

 

Clicking in the Main menu on “Advanced mode…” will bring you to the menu “Advanced Mode”, where you can select five tab pages. In general, you can set a number of switches for the compiler and the parser, disassemble a compiled grammar, compile a lexicon and link several grammars together. Moving the cursor over one of the switches will reveal a short explanation.

For each exercise, the values of the switches are contained in the switches’ file, with extension .sw.

 

However, for doing the exercises in this User Manual it is not necessary to use the Advanced Mode.

The Advanced Mode is explained in more detail in the Reference Manual.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5.1     Use your favorite editor

You are encouraged to experiment with the examples in this User Manual. In doing so you will operate on text files. You may use your favorite text editor (a simple editor like Wordpad will be sufficient.)

In Windows, use “mapoptions” to associate the extensions of the relevant input and output files of Transform to your editor. The extensions are:

-         dtd (this type could also be associated with an XML-editor)

-         grm (the grammar file)

-         tdo (the target document created by the transducer)

-         txt (the source document)

-         xml (the target document) (this type could also be associated with a browser like Internet Explorer)

A few more types of files are created by the compiler and the run time system. However, you do not have to inspect or alter them. (If you are interested: read more on files related to compilation and files related to the runtime system.)

6     Operation

The operation of the Transform system is described here. A description is given on types of files, error handling, batch operation and screen output.

7     Syntax: EBNF grammars and Regular Expressions

7.1     The formalism of Transform in a technical nutshell

The basic syntax of Transform is that of EBNF grammars, together with Regular Expressions for terminals. The symbols for EBNF are the same as for content models in XML DTD’s.

7.1.1        Recognition grammars

A grammar consists of a set of rewrite rules, which have a left hand side (“lhs”) and a right hand side (“rhs”). The set is unordered.

The basic syntax is extended in a number of ways:

-              everywhere in a rhs so-called “Actions” may be specified, which contain small programs; the programs use variables which scope is the grammar-rule. With the aid of variables one may construct the target document.

-              nonterminals may have parameters, which may be the variables used in the Actions. Parameters are used to distribute components of the source document, e.g. for the construction of XML attributes or for re-ordering purposes.

 

7.1.2        Transduction grammars

If the lhs of a rule contains one symbol it is called a context-free rule; if it contains more than one symbol it is called a type-1 or 0 Chomsky-grammar (type 0 if the lhs is longer than the rhs).

Type-0 rules may be written as transduction rules. In that case a rhs becomes a pattern that, if located in the source document, will be rewritten according to the lhs.

If the compiler switch Transduce is set to true then a grammar will be treated as a transduction grammar.

Transduction rules behave completely different from recognition rules, while the syntax is nearly the same. All transduction rules run in parallel. The output of a rule is again subject to the matching of all rules.

Recognition rules describe precisely the sequence of characters within the source document. For each notion in the grammar its context is automatically determined by the whole grammar. The parsing of a source document with a recognition grammar may fail or succeed.

To the contrary, the transduction of a source document always succeeds. This is the way XSLT and tools based on pattern-matching operate.

With transduction rules you may delete, insert, replace and reorder components of the source document within context. Sets of transduction rules may be cascaded in order to provide for a number of conversion phases within one run of Transform.

Transduction rules may contain non-terminals which are, in their turn, defined by recognition rules.

Transduction grammars can be made very powerful. To get an impression: see the examples in the directories examples_more\cijfers and examples_more\chiffres on the translation of numbers into Dutch and French alphabetic representations.

 

7.1.3        When to use recognition and when transduction grammars

It is advised to use recognition grammars in the first stage of a conversion process. With a recognition grammar you may specify the syntax of the source document on the character level. All inconsistencies and errors will be reported. In that case either the grammar or the source document will have to be adjusted.

When recognition succeeds one may proceed in two ways. The recognition grammar may be supplied by actions and parameters in order to produce the output document. On the other hand one may reorganize the recognition grammar into a transduction grammar. In the case of up-conversion to XML the first strategy suffices most of the time.

In case the structure of the source document is quite different from the structure of the target XML document one may first specify a document schema according to the structure of the source document and convert according to that schema. In a second stage XSLT may be used for the conversion to the target document.

In the case of up-conversion of “floating elements” into an XML mixed content model the use of recognition and transduction grammars may be combined. Floating elements escape, by definition, a rigorous syntactic description of their position in a document. A limited transduction grammar may then be used to identify floating elements in the source document and to supply them with XML tags (and attributes). This can be done in a pre- or post processing stage adjacent to the recognition stage. This will be demonstrated in example 5.

 

Not all extensions to the syntax can be used together; the allowed combinations are listed. Read more on the allowed combinations.

 

7.2     EBNF

The meta symbols used in Transform grammars for EBNF notation are listed below, together with their names and a short description of their functions:

 

Delimiters and operators used in rules are:

 

name

meta symbol

function

rewrite sign

::=

recognition: written between left- and right-hand side

transduction sign

<::

transduction: written between left- and right-hand side

concatenation sign

,

concatenation of symbols (optional)

alternative sign

|

between alternatives

end of rule sign

.

end of a rule

comment signs

/* and */

start and end comment

action brackets

{ and }

between these brackets actions are notated

                       

Regular expressions are used for the description of strings of characters as well as for the grouping of symbols, like in an XML content model. Reserved symbols for regular expressions are:

 

name

meta symbol

function

regular expression open

(

open bracket of a regular expression

negated reg. expr. open

`(

open bracket for a negated regular expression

r.e. close zero or one

)?

zero or one times the enclosed part

r.e. close one

)

one time the enclosed part

r.e. close zero or more

)*

zero or more times the enclosed part

r.e. close one or more

)+

one or more times the enclosed part

 

The comma in between symbols is optional. For readability, whitespace may be added freely. Empty rules are allowed: in that case the rhs is absent. All types of recursion are allowed (left-, middle- and right-recursion).

For examples: see grammar Ex_1.grm below.

 

7.2.1        Symbols

7.2.2        Characters

All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:

 

      Vowel ::=   'a' | 'e' | 'i' | 'o' | 'u' .

s     ::=   ' '  |  #x09 | CR.      /* space, TAB or CR */

      Quote ::=   '''' | '"'.             /* The single quote itself must be doubled. */

 

The reserved symbol "cr" or "CR" matches an end of line character.

All terminal symbols may also be represented by the number representing their underlying code. A value in the (decimal) range ‘0’ to ‘255’ represents a character from the extended ASCII set, ISO 8859/1. (In the future, Transform will support wide-range character sets.) The values have to be written in hexadecimal notation, like ‘#x61’.

In summary, there are alternative ways to represent a single character in a grammar:

 

"name characters" like "a"

"single quote" character

"end of line" character

other characters like "."

'a'

''''

CR or cr

'.'

#x61

#x27

#x0D

#x2E

 

The matching of characters is by default case sensitive. This may be suppressed by a compiler switch.

 

7.2.3        Strings

Sequences of terminal symbols may be given as a string between single quotes. For example:

      abcString   ::=   'abc' .

is another notation for:

      abcString   ::=   'a', 'b', 'c' .

or for:

      abcString   ::=   'a' 'b' 'c' .

 

7.2.4        Ranges of characters

A range of characters is specified like in

      Consonant   ::=   [b-d] | [f-h] | [j-n] | [p-t] | [v-x] | 'z' .

      Brackets    ::=   [#x28-#x29] .     /* "(" and ")" */

The negated variants are also supported: [^a], [^a-z], [^#xN] and [^#xN-#xN].

 

Note: the syntax [0-9a-z] is currently not allowed. It has to be written as [0-9] | [a-z].

 

7.2.5        Wildcards

Wildcards are meta symbols which match arbitrary characters (including carriage return, form feed etc.). There are three different wildcards, with different functions: a dontcare, a line and an arb symbol. They are listed below, together with their functions.

 

symbol

name

function

*

dontcare

matches exactly one arbitrary character

=

arb

matches zero or more arbitrary characters, not including the character following the arb symbol in the grammar

line

matches any sequence of zero or more arbitrary characters, including occurrences of the character following the line symbol in the grammar

 

For example:

      Password    ::=   *, *, *, *, *, * .

defines a password as a sequence of six arbitrary characters. And in:

      Comment     ::=   '/*',  =,  '*/' .

a sequence of characters which starts with ‘/*’ and ends with ‘*/’ is recognized as a comment. The characters which are matched by the arb symbol will not contain a ‘*’. An alternative definition of comment:

      Comment     ::=   '/*',  ‑,  '*/' .

defines the same sequence of characters, but the characters which are matched by the line symbol may contain one or more ‘*/’. For example:

      /* a comment is marked by a "*/" at the end */

will be matched twice by the rule with a line symbol. It will only be matched once by the rule with an arb symbol.

 

7.2.6        Non-terminal symbols

Non-terminal symbols are symbols that occur at least once on the lhs of a context free rule. Their names consist of a sequence of one or more alphabetic characters (uppercase or lowercase, case sensitive), the digits 0 to 9 and the underscore character ("_"). The maximum length of a non-terminal name is 32 characters.

7.2.7        Regular expressions

The components of regular expressions may be terminals as well as nonterminals. Nesting of regular expressions may be of any depth. Regular expressions can not be used on the left-hand side of grammar rules.

7.3     More reading

8     A running example

In the sequel we will gradually build up a complete example of the conversion of an aerospace manual.

The examples are in directory “Examples_User_Manual” and are called Ex_1, Ex_2 and so on. The corresponding filenames have self-evident extensions. The complete example is in exercise 4. Subsequent examples contain variations.

For each example we give, if applicable, an explanation, the source document, the XML DTD, the Transform grammar and the target document in XML.

Within the examples the following elements and attributes are used:

 

Element

Element description

Attributes

Attribute description

LI

list item

 

 

MAT

material (floating element)

MREF

reference to the material

NOTE

note (floating element)

 

 

RT

refer to task

RID, PAGE

reference to a task and to a page number

STID

subtask identification

ID

identification of a subtask

STT

subtask title

LTR

sequence indicator (A, B etc)

SUBTASK

subtask

 

 

TASK

task

 

 

TID

task identification

ID

identification of a task

TT

task title

 

 

WAR

warning (floating element)

 

 

 

The first example serves as a kind of “Hello word” example. A tiny fragment of the source document will be used to give a quick hands-on experience. The target document only contains start- and end tags.

9     Example 1: getting started

This example shows the first two lines of our example document. They are in file Ex_1.txt in directory Ex_1.

9.1     Source document (Ex_1.txt)

TASK

Remove Hatch Seals

 

9.2     DTD (Ex_1.dtd)

The DTD of the target document is contained in file Ex_1.dtd:

 

<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT TASK (TT)>

<!ELEMENT TT (#PCDATA)>

9.3     Grammar (Ex_1.grm)

The grammar which describes Ex_1.txt is in file Ex_1.grm. For training purposes it is of greater detail then necessary to identify the #PCDATA. The grammar specifies also the characters, words and whitespace within the #PCDATA.

 

Starting with Ex_1.dtd you may arrive at Ex_1.grm by following the next steps (which will be augmented in chapter 15).

-         rewrite the DTD in Transform grammar notation:

TASK  ::= TT .

TT    ::= PCDATA.

-         add terminal symbols which are associated with the start and the end of the elements

TASK  ::= 'TASK', TT.

-         refine PCDATA

PCDATA ::= token, ( whitespace, token )*.

-         refine token

token ::= ( char )+.

-         add allowable whitespace, like in:

TASK  ::= 'TASK', whitespace, TT.

-         add regular expressions for char and whitespace.

whitespace ::= ( s )+.

char  ::= [#x00-#x08] | [#x0A-#x0C] | [#x0E-#x1F] | [#x21-#x7F].

      /* all characters, except space, TAB and CR */

s     ::= ' ' | #x09 | CR. /* space, TAB or CR */

 

Format the completed grammar in an easy-to-read style, for instance like:

 

TASK ::=

      'TASK'

      whitespace

      TT.

 

TT ::=

      PCDATA.

 

/* symbols in document: */

 

PCDATA ::=

      token

      ( whitespace

            token

      )*.

 

token ::=

      ( char )+.

 

whitespace ::=

      ( s )+.

 

char ::=

      `(' ' | #x09 | #x0D).   /* all characters, except space, TAB and CR */

 

s ::=

      ' ' | #x09 | CR. /* space, TAB or CR */

9.4     Compile and parse Ex_1

We will now run the compiler and the parser.

-         In the GUI press “Select grammar…” and select the file Ex_1.grm. The selected filename will show up in the status bar on top of the GUI. Transform will memorize the filename in between sessions until another filename is selected.

-         press “Compile grammar…”. A command box pops up and a number of messages appear about the deletion of intermediary files. Then the current values of switches for the compiler are shown. If everything goes well you get the message: “# Command completed, exit status = 0”. Press the button “Close” at the bottom of the command box.
If there are errors in the grammar they will be shown in the command box as well as in the file Ex_1.err.

-         press “Select input file…” and select the source document Ex_1.txt. The selected filename will also show up in the status bar on top of the GUI. Close the command box.

-         press “Run grammar”. Again a command box shows up. If everything goes well you get the messages “Input correct.” and “exit status = 0”. If not, error-messages are given.

-         you are finished. Because the grammar contains no output-instructions no target document will be created.

9.5     Experiment

You may now start your first experiments. Change the grammar and the source document in a controlled way and see what is happening. If there are errors in the grammar or in the source document Transform will try to recover and to report as many errors as possible.

10              Syntax: extensions for output, variables and expressions

10.1 Introduction

Actions relate to extensions of the formalism. The main types of actions are:

(1)        messages

(2)        assignments and tests

The general rules for Actions are:

           Actions are written between braces "{ ...}".

          Apart from assignments and tests, an action starts with a keyword determining the type of the action. These keywords are "r:" and "s:" and may be written in upper or lowercase without difference in meaning.

-                     Actions may be conjoined by the conjunction operator (semicolon ";" or ampersand "&").

-                     Actions may only be used on the right-hand side of a context free rule. They may occur:

o       after each symbol or regular expression

o       between the ")" and the " * " or "+" of a regular expression

o       as the first (or only) element of an alternative.


For some types of actions the effects depend on their position in the rule.

More reading on Actions .

11              Syntax: Output

11.1 Introduction

The target XML document may be created by the use of so-called “messages”. Messages mark the relevant points in the recognition process by which other (external) processes can be triggered. Messages provide for event-based processing in general purpose programming languages, like the SAX API’s.

An extensive description of messages is given here. For the creation of a target XML document the following description suffices.

 

When messages are used in a grammar, an output file with the extension .REP (also called "report file") is created in which the messages are written. (If the numbers of the reports are suppressed then the extension will be .XML.) There are two variants: simple messages and messages with variables.

 

11.1.1    Simple messages

 

A message is always written within an action as:

{ r: <number> }.

 

Reports or r‑messages report the number followed by a colon plus the most recently matched terminal symbol. In the case of subsequent r‑messages with the same report number, the report number is given once and the reported symbols are concatenated (even if they were not contiguous in the input). This mechanism unifies the creation of either a target XML document or a file containing markers for event-based processing.

 

R‑messages always output the character that has most recently been read. Thus, if a report follows a symbol or regular expression that covers no input (as may be the case in the "( )" and "( )* " expressions), the character that was read before the regular expression will be output.

 

In the grammar:

tokens      ::=   token, ( separators, token )*.

token       ::=   ( char {R:1})+.

      char        ::=   [a-z] .

separator   ::=   ' '  |  #x09.     /* space or TAB */

 

any sequence of lowercase alphabetic characters will be recognized as a token and will return as output a report file with one line. It consists of the report number "1" followed by a colon and all recognized characters, since subsequently reported terminal symbols with the same number are concatenated.

 

For example, for the string "abc   def " the output will be:

            1:abcdef

The separator “space” will not be reported.

 

With the runtime switch CASE_ CONVERT, with values LOWER_ TO_ UPPER and UPPER_ TO_ LOWER, one may convert the case of the input.

 

11.1.2    Messages with variables

 

A variant of the simple r‑messages can be used to communicate the value of expressions with variables. In this use, the report or signal number is followed again by a colon followed by an expression:

 

            {r:<number>:<expression>}.

 

The use of variables and expressions will be described in the sequel.

 

The effect of this action is that as soon as the input is recognized up to the point of the action, the current value of the expression will be written to the report file, preceded by the number and a colon. The same rule for concatenation is followed as for simple messages.

 

Expressions may contain literal string values. Therefore, the action

      {R:1:'<TASK>'}

will write the tag <TASK>.

 

The next example (2)  is an extension of example 1. Messages are added and a target XML document is created. A string of whitespace characters is replaced by a single blank.

11.2 More reading

12              Example 2: Output

12.1 Source document (Ex_2.txt, same as Ex_1.txt)

TASK

Remove Hatch Seals

12.2 DTD (Ex_2.dtd, same as Ex_1.dtd)

<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT TASK (TT)>

<!ELEMENT TT (#PCDATA)>

12.3 Grammar (Ex_2.grm, as an extension to Ex_1.grm)

TASK ::=

      'TASK'

      {R:1:'<TASK>'}

      whitespace

      {R:1:'<TT>'}

      TT

      {R:1:'</TT></TASK>'}.

 

TT ::=

      PCDATA.

 

/* symbols in document: */

 

PCDATA ::=

      token

      ( whitespace

      {R:1:' '}

      token

      )*.

 

token ::=

      ( char {R:1})+.

 

whitespace ::=

      ( s )+.

 

char ::=

      `(' ' | #x09 | #x0D).   /* all characters, except space, TAB and CR */

 

s ::=

      ' ' | #x09 | CR. /* space, TAB or CR */

 

12.4 Target document (Ex_2.xml)

<TASK>

  <TT>Remove Hatch Seals</TT>

</TASK>

13              Syntax: Variables and expressions

13.1 Introduction

In Transform grammars, two kinds of variables can be used: local variables and parameters. The main difference between the two types lies in their scope; the scope of a local variable is the rule in which it occurs, a parameter can be used to transmit information between different rules. All operations that can be performed on local variables are applicable to variables introduced as parameters as well.

13.2 Variables, literals and expressions

Every cf rule, except context sensitive and transduction rules, may be extended with local variables. Information (in the form of a string) available at some point in the rule can be stored in a variable so that it can be used at another point in the same rule. A variable has a name and a value (which is always a string).

A local variable need not be declared; the variable will be allocated at the moment of its first reference in the rule. At the moment of its allocation, it contains the empty string and this remains its content up to the moment that another value is assigned to it.

There is a reserved variable name (percent "%"). Its value is always the character that has most recently been read.

Besides variables, literal string values (also called literals) can be used in actions. Literals are strings of characters between single quotes. The empty string can be represented by merely two quotes. To represent a single quote as part of a literal, it should be written twice.

 

Expressions can be built from variables and literals. This is done with the concatenation operator "||" or “+” (equivalent). The value of an expression is the concatenated value of all its components. (The term "expression" refers also to its most simple form: a simple variable or a simple literal.)

 

As an example, we give the value of some expressions that can be composed from the following components:

            var1     with the value    "aa"

            var2     with the value    "bb"

            %         with the value    'c'

                        the literal           'dd'

 

expression

resulting value

var1 || var2

"aabb"

var1 || %

"aac"

'dd' || var2

"ddbb"

var1 || % || var1

"aacaa"

13.3 Assignment

A variable may be assigned a value by means of the assignment operator ("="); in general:

      { <variable> = <expression> }.

The value of a variable can be set in different ways, for example:

      {var1=var2} or {var1=%}       /* assignment of a variable */

      {var1='abc'}                  /* assignment of a literal */

      {var1=var2||'abc'}            /* assignment of a expression */

 

The value of a variable may be empty at any time. Remember that a variable has an empty value at the moment of its first appearance in a rule; its value can become the empty string again by assigning the literal representing the empty string to it. Therefore:

      {var=''}

removes the value of the variable "var".

Explicit assignment to the reserved "%" is not possible; this variable is assigned its value, the last read character, in run time.

13.4 Tests on expressions

13.4.1    Introduction

A powerful mechanism is the application of tests: the recognition of input can be made dependent of a testable criterion. This feature can be used successfully in grammars that describe ambiguous phenomena and can limit the number of alternative parse trees.

13.4.2    Types of tests

Tests are performed on variables and literals, which can be combined together into expressions

Tests concern the truth value of a comparison between two expressions. If this value is false, the current recognition path stops. There are two test operators, one for equality and one for inequality.

The equality test:

      { <expression1> == <expression2> }

returns the value true if "expression1" has the same value as "expression2", while the inequality test:

      { <expression1> != <expression2> }

returns the value true if "expression1" refers to a value other than that specified by "expression2".

Disjunct tests may be stated by means of the "or"-operator:

      { <test_1> | <test_2> | ... | <test_n> }

is true when the evaluation of at least one of the tests leads to the value true. For example:

      { var1== '' | var2==% }

is true if the value of the variable "var1" is empty or if "var2" has the value of the last read character in the input.

13.4.3    Combinations of tests and/or assignments

Multiple assignments may be conjoined by the conjunction operator (";" or "&"):

      {var1=expr1;  var2=expr2}     or    {var1=expr1 & var2=expr2}

Multiple tests may also be conjoined by the conjunction operator:

      {var1==expr1 & var2!=expr2}   or    {var1==expr1;  var2!=expr2}

is true when all of the tests are true.

The two operators behave identical. It is customary to separate assignments by the “;” and to separate tests by the “&” operator.

13.5 Parameters

A non-terminal symbol may be supplied with parameters, which can be viewed as attributes of the symbol (like in attribute grammars). They are declared together in a so-called formal parameter list of a non-terminal symbol on the left-hand side of a context free rule. There are three types of parameters:

 

i (input)             : taking (inheriting)

o (output)                     : giving (synthesizing)

io (input/output)            : taking and giving

 

The keywords "i", "o" and "io" are case insensitive.

 

Once declared, the type of the parameters is fixed. The type of the parameter must be specified in the formal parameter list, followed by a colon, followed by one or more names of parameters, separated by a comma. A parameter list can contain parameters of one, two or three types. There is no prescribed order if more than one type is used. Each type indicator is separated from the preceding list of names by a comma.

 

Examples of formal parameter lists are:

 

      Verb(I:MainVerb)  ::= ...

            Symbol "Verb" has one parameter of which the value is determined in the calling rule; the value of "MainVerb" is not affected after application of the present rule.

 

      VsMod(O:type)     ::= ...

            Symbol "VsMod" has one parameter of which the value is initially an empty string; the value of "type" is determined during the application of the present rule.

 

      Noun(IO:cat)      ::= ...

            Symbol "Noun" has one parameter of which the value is initially determined by the calling rule; the value of "cat" may be changed after application of the present rule.

 

      RelClause(I:Antecedent,O:type,IO:cat)     ::= ...

            Symbol "RelClause" has three parameters, one of each type.

 

      NP(O:cat,number,pers)   ::= ...

            Symbol "NP" has three parameters, all of the same type.

 

When the non-terminal is used on the right-hand side of one or more rules, the parameters must be repeated in a so-called actual parameter list. The names of the actual parameters can of course be different from the names used in the formal parameter list. In the actual parameter list, it is allowed to repeat the type but it is not necessary. If the type is indicated on the right-hand side, it must correspond to the type within the declaration. So, one can choose between, for instance:

 

 ...  NP(cat, number, pers>  ...          and ...  NP(O:cat, number, pers>  ...

 

The use of parameters enables the exchange of information between rules. For example, number agreement of subject and verb can be handled with the use of parameters in combination with a test action:

 

      Sentence          ::=   NP(number1),  VP(number2) {number1==number2} .

      NP(O:number)      ::=   ... .

      VP(O:number)      ::=   ... .

 

The starting symbol can also have output parameters (but no input or input/output ones). The content of these top level parameters can be written to the report file and/or the screen with the run time switch: OUTPUT_ PARAMETERS.

 

The same operations that are allowed on variables like assignments and tests are allowed on parameters. In a rule with a formal parameter list, local variables can be introduced as well.

 

            Recommendation: The compiler distinguishes parameters from local variables by examining the formal parameter list. Thus, when a parameter name is misspelled (due to a typing error), it is silently considered to be a local variable. Therefore, it is advisable to set the compiler switch CHECK_VARIABLES to TRUE especially in early stages of the development of a grammar. Then, for every rule in the grammar, the local variables will be written to the compiler error file.

            Restriction: Parameters can only be used if FINITE=FALSE. The reason is that, if FINITE=TRUE, the compiler will try to create a FSM, which has no memory for variables.

13.6 More reading

The next example demonstrates the use of variables.

14              Example 3: Variables

14.1 Source document (Ex_3.txt)

TASK t52-43-01-000-800

14.2 DTD (Ex_3.dtd)

<!ELEMENT TASK (TID)>

<!ELEMENT TID EMPTY>

<!ATTLIST TID

  ID CDATA #REQUIRED

> 

14.3 Grammar (Ex_3.grm)

TASK ::=   

      'TASK'

      whitespace

      't'

      TID(I:ID)

      {R:1:'<TID ID="' || ID || '"/>'}.

 

TID(O:ID) ::=

      number(I:N) {ID = ID||N}

      ( '-' {ID = ID||%}            /* ‘%’ denotes the last read character, which is

‘-‘. Its value is concatenated to ID */

        number(I:N) {ID = ID||N}    /* ‘number’ delivers the value of ‘N’, which

is concatenated to ‘ID’ */

      )*.

 

number(O:N) ::=                     /* the value of N is constructed by

concatenating all individual digits */

      ( d {N = N||% })+.

 

d ::=

      [#x30-#x39].

 

whitespace ::=

      ( #x09 | #x20 | CR )+ .

 

14.4 Target document (Ex_3.xml)

<TID ID=”52-43-01-000-800”/>

15              Methodology for up-conversion: from DTD to grammar

15.1 Introduction

After the introduction of the main parts of the formalism of Transform it is now time to show the complete example on the up-conversion of the aerospace manual. We follow a general working procedure, which is a refinement of  the working procedure already outlined in the position paper on up-conversion to XML.

15.2 SPECIFY

-              Obtain a number of source documents together with their target documents in XML

-              Obtain the document schema, preferably in 2 notations:

o                      an XML schema notation with restrictions on types 

o                      the DTD notation in order to reuse the content models (many XML tools can generate a DTD from an XML schema).

-              Convert the DTD, in a one-to-one correspondence, into a grammar in the Transform formalism.

o                      Rewrite an element definition into a grammar rule

§                                 the element name becomes the lhs of the rule

§                                 the content model becomes the rhs of the rule

e.g.

<!ELEMENT SUBTASK (STID, STT, NOTE?, WAR?, LI+)>

becomes

SUBTASK ::= STID, STT, (NOTE)?, (WAR)?, (LI)+.

(Note that single symbols which are optional or repeatable have to be surrounded by parenthesis)

o                      Rewrite the attributes in an attribute definition as parameters to the nonterminal which replaces the element name.

e.g.

<!ELEMENT STT (#PCDATA)>

<!ATTLIST STT

LTR CDATA #REQUIRED>

becomes

STT(O:LTR) ::= PCDATA {….LTR = …}.

-              Observe the documents and determine in which manner they contain material that identifies the start and end of elements and the appearance of attributes and entities.

In general there are 3 types of information:

o                      A. strings which mark the structure of elements

o                      B. strings which are part of #PCDATA

o                      C. strings which are purely used for presentation, like whitespace.

Human intervention in the conversion process will be necessary if

o                      some XML markup can not be identified in the documents

o                      the same identifying material is used for different markup, and there is no sufficient context for disambiguation.

-              Add regular expressions to the grammar which describe all variants of the strings of type A, B and C.

Be precise in the specification of strings of type C; e.g. the following rules are ambiguous:

P ::= Q, R.

Q ::= X, whitespace.

R ::= whitespace, Y.

whitespace ::= ( space | tab | cr )+.

This ambiguity can easily be avoided by following the rule that either:

o                      no grammar rule starts with strings of type C

o                      or no grammar rule ends with strings of type C

-              If an XML schema is available, add regular expressions to the grammar which account for the restrictions on types in the schema.

15.3 PARSE

Compile the grammar into a parser. Parse the documents and let errors be reported.

15.4 ADJUST

Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.

15.5 CONVERT

Add output-specifications to the grammar. The documents will be converted into XML.

16              Example 4: The complete conversion

16.1 Introduction

This example shows the conversion of the complete aerospace manual. Note the correspondence between the DTD and the grammar.

16.2 Source document

 

TASK t52-43-01-000-800

Remove Hatch Seals

 

SUBTASK  s52-43-01-010-001-001

A.    Get access

      (1)   Open the hatches and remove them

             (refer to TASK t52-43-00-000-800 / page book 401).

      (2)   Put the hatches down on a soft mat.

 

SUBTASK s52-43-01-020-001-001

B.    Remove the hatch seals.

      NOTE On each hatch there are 42 screws.

      (1)   Remove the screws.

(2)   Remove the retainer strips.

 

SUBTASK  s52-43-01-350-001

C.    Remove corrosion

      (1)   Remove any corrosion with abrasive

            cloth (material No. Fk05-024).

 

SUBTASK  s52-43-01-110-001

D.    Clean the hatch

      WARNING BE CAREFUL WHEN YOU USE THE

      METHYLETHYLKETONE (MEK). IF YOU GET MEK IN YOUR EYES:

      -     FLUSH YOUR EYES WITH WATER

      -     GET MEDICAL HELP.

      THE MEK IS POISONOUS.

(1)   Clean the areas with MEK (material No. Fk11-003).

 

16.3 DTD

For a quick reference we repeat the table with element descriptions.

 

Element

Element description

Attributes

Attribute description

LI

list item

 

 

MAT

material (floating element)

MREF

reference to the material

NOTE

note (floating element)

 

 

RT

refer to task

RID, PAGE

reference to a task and to a page number

STID

subtask identification

ID

identification of a subtask

STT

subtask title

LTR

sequence indicator (A, B etc)

SUBTASK

subtask

 

 

TASK

task

 

 

TID

task identification

ID

identification of a task

TT

task title

 

 

WAR

warning (floating element)

 

 

 

<?xml version="1.0" encoding="UTF-8"?>

<!ELEMENT TASK (TID, TT, SUBTASK+)>

<!ELEMENT TID EMPTY>

<!ATTLIST TID

  ID CDATA #REQUIRED

> 

<!ELEMENT TT (#PCDATA)>

<!ELEMENT SUBTASK (STID, STT, NOTE?, WAR?, LI+)>     

<!ELEMENT STID EMPTY>

<!ATTLIST STID

  ID CDATA #REQUIRED

> 

<!ELEMENT STT (#PCDATA)>

<!ATTLIST STT

  LTR CDATA #REQUIRED

> 

<!ELEMENT NOTE (#PCDATA)>

<!ELEMENT WAR (#PCDATA | LI)*>

<!ELEMENT LI (#PCDATA | RT | MAT)*>

<!ELEMENT MAT EMPTY>

<!ATTLIST MAT

  MREF CDATA #REQUIRED

> 

<!ELEMENT RT EMPTY>

<!ATTLIST RT

  RID CDATA #REQUIRED

  PAGE CDATA #REQUIRED

> 

 

16.4 Target document in XML

 

<!DOCTYPE TASK SYSTEM "F.dtd">

<TASK>

  <TID ID="t52-43-01-000-800"/>

  <TT>Remove Hatch Seals</TT>

  <SUBTASK>

      <STID ID="s52-43-01-010-001"/>

      <STT LTR="A">Get access</STT>

      <LI>Open the hatches and remove them <RT RID="t52-43-00-000-800-0Ol" PAGE="401"/>.</LI>

      <LI>Put the hatches down on a soft mat.</LI>

  </SUBTASK>

  <SUBTASK>

      <STID ID="s52-43-01-020-001"/>

      <STT LTR="B">Remove the hatch seals.</STT>

      <NOTE>On each hatch there are 42 screws.</NOTE>

      <LI>Remove the screws.</LI>

      <LI>Remove the retainer strips.</LI>

  </SUBTASK>

  <SUBTASK>

      <STID ID="s52-43-01-350-001"/>

      <STT LTR="C">Remove corrosion</STT>

      <LI>Remove any corrosion with abrasive cloth<MAT MREF="FkO5-024"/>.</LI>

  </SUBTASK>

  <SUBTASK>

      <STID ID="s52-43-01-110-001"/>

      <STT LTR="D">Clean the hatch</STT>

      <WAR>BE CAREFUL WHEN YOU USE THE METHYLETHYLKETONE (MEK). IF YOU GET MEK IN YOUR EYES:

        <LI>FLUSH YOUR EYES WITH WATER</LI>

        <LI>GET MEDICAL HELP.</LI>

         THE MEK IS POISONOUS.

      </WAR>

      <LI>Clean the areas with MEK<MAT MREF="Fk11-003"/>.</LI>

  </SUBTASK>

</TASK>

 

16.5 Grammar

 

TASK  ::=

      'TASK'

      {R:1:'<TASK>'}

      whitespace

      TID(I:ID)

      {R:1:'<TID ID="' || ID || '"/>'}

      whitespace

      {R:1:'<TT>'}

      TT

      {R:1:'</TT>'}

      ( whitespace

        {R:1:'<SUBTASK>'}

        SUBTASK

        {R:1:'</SUBTASK>'}

      )+

      ( whitespace )?

      {R:1:'</TASK>'} .

 

TID(O:ID) ::=

      't' {ID = %}

      Numbers_with_hyphens(I:ID1)

      {ID = ID || ID1}.

 

TT ::=

     PCDATA.

 

SUBTASK ::=

      'SUBTASK'

      separators

      STID(I:ID)

      {R:1:'<STID ID="' || ID || '"/>'}

      whitespace

      STT(I:LTR)

      ( whitespace

        {R:1:'<NOTE>'}

        NOTE

        {R:1:'</NOTE>'}

      )?

      ( whitespace

        {R:1:'<WAR>'}

        WAR

        {R:1:'</WAR>'}

      )?

      ( whitespace

        STLI

      )+.

 

STID(O:ID) ::=

      's' {ID = %}

      Numbers_with_hyphens(I:ID1)

      {ID = ID || ID1}.

 

NOTE  ::=

      'NOTE'

      ( separators

        no_hyphen_parenleft {R:1}

        PCDATA_till_CR

      )+.

 

WAR   ::=

      'WARNING'

      ( separators

        no_hyphen_parenleft {R:1}

        PCDATA_till_CR

        |

        WARLI

      )+.

 

WARLI ::=

      ( separators )?

      '-'

      separators

      {R:1:'<LI>'}

      PCDATA_till_CR

      {R:1:'</LI>'}.

 

STLI  ::=

      '('

      number(I:N)

      ')'

      separators

      {R:1:'<LI>'}

      MIXED_CONTENT

      {R:1:'</LI>'}.

 

STT(O:LTR) ::=

      capital {LTR=%}

      ('.')?

      separators

      {R:1:'<STT LTR="' || LTR || '">'}  

      PCDATA_till_CR

      {R:1:'</STT>'}.  

 

MIXED_CONTENT ::=

      (

        ( no_parenleft {R:1} )+

        | MAT

        | RT

      )+.

 

MAT   ::=  

      '('

      ('M' | 'm')

      'aterial No. Fk'

      Numbers_with_hyphens(I:mref)

      ')'

      {R:1:'<MAT MREF="Fk' || mref || '"/>'} .

 

RT    ::=

      '(refer to TASK '

      't'

      Numbers_with_hyphens(I:rid)

      ' / page book '

      number(I:page)

      ')'

      {R:1:'<RT RID="t' || rid ||'" PAGE=" ' || page || '"/>'} .

 

d     ::=

      [#x30-#x39].  /* name "d" as in XML Schema */

 

Numbers_with_hyphens(O:ID) ::=

      number(I:N) {ID = ID || N}

      ( '-' {ID = ID || %}

        number(I:N) {ID = ID || N}

      )*.

 

number(O:N) ::=

      ( d {N = N || % })+.

 

PCDATA ::=

      tokens

      (

        ( separators )?

        CReturn

        ( separators )?

        tokens

      )*.

 

PCDATA_till_CR ::=

      tokens

      ( separators )?

      CReturn.

 

tokens ::=

      token

      ( separators {R:1:' '}

        token

      )*.

 

separators ::=

      ( separator )+ .

 

s ::=

      separator

      | CReturn.  /* name "s" as in XML Schema */

 

whitespace ::=

      ( s )+.

 

token ::=

      ( char {R:1})+.

 

separator ::=

      ' '

      | #x09. /* space or TAB */

 

capital ::=

      [#x41-#x5A].

 

char ::=

      `(' ' | #x09 | #x0D). /* no space, TAB or CR */

 

no_parenleft ::=

      `('(' | #x0D). /* no '(' or CR */

 

no_hyphen_parenleft ::=

      `(' ' | #x09 | '-' | '('). /* no space, TAB, '-' or '(' */

 

CReturn ::= CR {R:1} .

 

17              Syntax: Extensions of context-sensitive grammars

17.1 Introduction

A context sensitive rule (technically called a Chomsky type-1 or type-0 rule) has two or more symbols at the left-hand side. By preference, the rewrite sign between the two sides is "<::". The left-hand side of such a rule may never be empty. In context sensitive rules ordinary terminals, like 'a' or #x0E or '.', can be used at both sides. The two sides can also contain non-terminals and intermediates; their use will be explained later.

Context-sensitive grammars are mainly used for linguistic purposes. Therefore, we concentrate on these purposes. (More reading). A variant of context-sensitive grammars are transduction grammars, which we will discuss in the next chapter.

17.2 Example

In the next example, the input "in the garden the man sees a bird" as well as "the man sees a bird in the garden" will be recognized as the starting symbol "Sentence".

 

      Sentence                ::=   Prep,  Subj,  Verb,  Obj .

/* a context-free rule */

      Prep                    ::=   'in the garden ' .

      Subj                    ::=   'the man ' .

      Verb                    ::=   'sees ' .

      Obj                     ::=   'a bird ' .

      Prep,  Subj, Verb, Obj  <::   Subj,  Verb,  Obj,  Prep . 

/* a context-sensitive rule */

18              Syntax: Extensions of transduction grammars

18.1 Introduction

In section 7.1.2 Transduction Grammars were introduced. In section 7.1.3 we discussed the applicability for XML up-conversion, e.g. when floating elements in the source documents have to be converted into mixed content. Example 5 will clarify this. We start with a short description of Transduction Grammars. More information is available in the Reference Manual.

 

Transduction grammars differ in many aspects of grammars used for recognition and/or parsing, while the syntax is nearly the same. Transduction grammars contain transduction rules which are essentially context-sensitive rules. However, transduction grammars do not have a first or starting symbol. A transduction rule can be interpreted as "the right-hand side (the input side), when recognized, will be replaced by the left-hand side (the output side)".

The left-hand side can be empty. In that case a pattern in the input matched by the right-hand side will be deleted. The right-hand side must contain at least one symbol (in a recognition grammar the right-hand side may be empty).

 

In a transduction grammar there is no need to describe the whole input as is the case in a grammar used for recognition and parsing. Only parts of the input may be covered by the rules. Input not described by any transduction rule will be simply emitted to the output.

It is important to be aware of the fact that during run time, as soon as a transduction rule reduces, the input covered by the right-hand side will be replaced by the symbols mentioned in the left-hand side. The run time system resumes scanning the input with the first symbol of the newly inserted ones, thus treating the replaced part as new input.

 

Examples:

 

      /*1*/ 'a'         <::   'a', 'a' .  /* each "aa" in the input will

be replaced by "a" */

      /*2*/ 'b', 'a'    <::   'a', 'b' .  /* each "ab" will be replaced by "ba" */

      /*3*/             <::   'd', 'd' .  /* each "dd" will be omitted */

 

A grammar consisting of only rule 1 will transform a large number of "a" 's into only one "a". The first "aa" will be replaced by one "a", which together with the next "a" forms a new pair, which again will be replaced by one "a", until finally the last "a" remains untouched. By the same mechanism, a grammar consisting of only rule 2 rewrites an input "aaabbb" to the output "bbbaaa" and an input "cabc" to the output "cbac". Finally, a grammar consisting of only rule 3 transforms an input "cddcd" into "ccd".

 

For a transduction grammar the compiler switch TRANSDUCE has to be set to TRUE. This has to be done in the “Advanced mode” of the gui. (The setting will be remembered when you return to the “Basic Mode”.)

 

18.1.1    Cover symbols

There are situations in which you want a non-terminal on the left-hand side to stand for the input it has matched.  This may be the case when you want to specify extra context in a rhs by one or more nonterminals, where the nonterminals may be described by context-free rules.

Using the cover sign (the caret, "^"), you can mark a non-terminal on the left-hand side of a context sensitive rule as being a cover non-terminal. Upon recognition of the rule, the run time system will replace cover non-terminals with the terminal symbols they have covered on the right-hand side.

(Note that it is an error to use a cover symbol that occurs twice or more at the right-hand side. On the other hand, it is allowed to use the same cover symbol more than once on the left-hand side.)

In run time, cover non-terminals behave as if they were substituted by the sequence of terminals covered by the same symbol on the right-hand side.

Example 1:

'pq', 1_Ending^   <::   'a',  1_Ending.

'r',  2_Ending^   <::   'a',  2_Ending .

1_Ending          ::=   'bc' | 'de' .

2_Ending          ::=   'kl' .

The input ‘ade’ will be rewritten as ‘pqde’, the input ‘akl’ as ‘rkl’.

 

Example2:

H1^   <:: '(' H1 ')' .

H1    ::= ([0-9])+.

The transduction of ‘a(325)bc’ will be ‘a325bc’.

 

18.1.2    Intermediate symbols

Intermediate symbols are used in grammars with context sensitive rules or in a context free grammar that is the last one of a cascaded grammar (see below). These symbols are called "intermediate" because they are not terminal and not non-terminal, but something in between. They differ from terminal symbols in that they do not occur in the original input: they are introduced in the left-hand side of a rule. They differ from non-terminal symbols in that they do not occur on the left-hand side of a defining (context free) rule in the grammar.

Intermediate symbols are mainly of interest for linguistic grammars on the grapheme/phoneme level. Further reading.

 

18.1.3    Ambiguity and "looping" problems in transduction grammars

 

Special attention must be paid to ambiguous transduction. This occurs if two (or more) context sensitive rules describe (part of) the same input. For example:

 

      /*1*/ 'a', 'c', 'b'     <::   'a', 'b' .

      /*2*/ 'd', 'b'          <::   'b' .

 

If (part of) the input is "ab" then as soon as the run time system reads the "b", both rules are recognized. Rule 1 rewrites this input to "acb" and rule 2 to "adb" and the grammar creates ambiguous output. By default the switch ONE_ CS_ REDUCE will be set to TRUE for transduction grammars. The effect is:

 

-                     if two or more context sensitive rules are applicable at the same moment of reduction, the run time system chooses the one with the longest right-hand side (the number of terminals that have been covered by the rule) and

-                     if two rules have the same length on the right-hand side, the one that is mentioned first in the grammar will be chosen.

 

If ambiguous output is really wanted, the compiler switch ONE_ CS_ REDUCE=FALSE  can be added to the switches file. In each case, the compiler generates a warning if ambiguous rewritings are to be expected.

 

In designing a transduction grammar, care must be taken not to write transduction rules that will bring the run time system in an endless loop. For example, the following innocent looking rule will cause such a looping problem:

      'a',  'a'   <::   'a' .

The input of only one "a" will be replaced by two "a" 's, each of which in turn will be replaced by two "a" 's, etc., until the program crashes due to a shortage of memory.

 

18.1.4    Negation

In transduction grammars the negation of a single character can be extended towards the negation of a string, of  a regular expression and of a nonterminal. More reading on negations.

18.2 More reading on transduction grammars

19              Example 5: Floating elements

19.1 Introduction

Some document schema’s are rather liberal in describing the position of elements, as if they are floating within mixed content. In that case pattern matching becomes a viable option in order to locate the start and end of the floating elements.

This example demonstrates preprocessing by a transduction grammar of the source document of example 4, in order to locate the floating elements Note, Warning, Mat, RT and LI.

In the DTD of ex. 4 the elements Mat and RT were located within mixed content. We assumed that the elements Mat and RT always start with a “(“ and that the PCDATA before Mat and RT does not contain “(“’s. This assumption avoids the problem of ambiguity when “(“ may be part of PCDATA ánd will initiate the start of Mat and RT. Note and Warning had a rather fixed position. LI elements always started on a new line.

In this example we assume that the PCDATA before Mat and RT may contain “(“’s and that the positions of Note, Warning and LI are flexible.

The target document of Ex_5 will have the extension “tdo”.

The preprocessing concerns the following:

-                     to find the start of the elements Note, Warning and LI; their position will be marked in Ex_5.tdo by the character #x80 (printed as a “€”) which does not occur in the source document (any such character may be taken). This marker will do the disambiguation, instead of the “(“. In order to find the start of the elements all relevant context may be specified in the transduction rules.

-                     to do an immediate markup of the elements Mat and RT.

The next grammar will use the special marker in order to locate the positions for the starttags and the endtags of Note, Warning and LI. The markup of Mat and RT will remain untouched, as being part of a stream of #PCDATA.

 

This example (and the next one) serves to familiarize oneself with transduction grammars and to get some feeling of what to delegate to a possible preprocessing phase and what to the recognition phase.

In general, preprocessing is only required for the precise handling of floating elements, where ambiguities in handling PCDATA have to be avoided.

19.2 Source document (Ex_5.txt, same as Ex_4.txt)

19.3 Grammar (Ex_5_td.grm)

 

'NOT' #x80 'E' <:: 'NOTE'.

 

'WARNIN' #x80 'G' <:: 'WARNING'.

 

'(' H1^ #x80 ')' <::   

      '('

      H1

      ')' .

 

'<MAT MREF="', Matno^, '"/>' <::

       '('

       ('M' | 'm')

       'aterial No. '

       Matno

       ')' .

 

Matno ::=

      'Fk'

      nums

      '-'

      nums .

 

'<RT RID="'

      TID^

      '" PAGE="'

      H1^

      '"/>'

<::

      '(refer to TASK '

      TID

      ' / page book '

      H1

      ')' .

 

TID ::=

      't'

      nums

      ( '-'

        nums

      )*.

 

H1 ::=

      nums .

 

num ::=

      [#x30-#x39].

 

nums ::=

      ( num )+.

 

H2^

       #x80

      ' '

<::

      H2

      separators.

 

H2 ::=

      CR

      ( separators )*

      '-'.

 

separator ::=

      ' '

      | #x09. /* space or TAB */

 

separators ::=

      ( separator )+ .

 

19.4 Target document (Ex_5.tdo)

 

TASK t52-43-01-000-800

Remove Hatch Seals

 

SUBTASK  s52-43-01-010-001-001

A.    Get access

      (1€)  Open the hatches and remove them <RT RID="t52-43-00-000-800" PAGE="401"/>.

      (2€)  Put the hatches down on a soft mat.

 

SUBTASK s52-43-01-020-001-001

B     Remove the hatch seals.

      NOT€E On each hatch there are 42 screws.

      (1€)  Remove the screws.

      (2€)  Remove the retainer strips.

 

SUBTASK  s52-43-01-350-001

C.    Remove corrosion

      (1€)  Remove any corrosion with abrasive cloth <MAT MREF="Fk05-024"/>.

 

SUBTASK  s52-43-01-110-001

D.    Clean the hatch

      WARNIN€G BE CAREFUL WHEN YOU USE THE

      METHYLETHYLKETONE (MEK). IF YOU GET MEK IN YOUR EYES:

      -€ FLUSH YOUR EYES WITH WATER

      -€ GET MEDICAL HELP.

      THE MEK IS POISONOUS.

      (1€)  Clean the areas with MEK <MAT MREF="Fk11-003"/>.

20              Example 6: Two-stage conversion

Use directory Ex_6. Copy grammar Ex_4.grm into Ex_6.grm. Adjust Ex_6.grm in such a way that Ex_5.tdo will be the source document. The target document Ex_5.xml has to be identical to Ex_4.xml.

More solutions are possible. One possible solution is contained in Ex_6_possible_solution.grm.

21              Syntax: Cascaded grammars

21.1 Introduction

It is possible to split a grammar into two (or more) grammars, and to link these grammars together into a so-called cascaded grammar. The run time system will treat the input initially with the first grammar and will use the resulting output as input for the second grammar, and so on. The linking can be performed by the program Plink . All grammars have to be transduction grammars, except the last one. The last grammar of a cascaded grammar can be a transducer or a recognizer/parser. In the latter case, the recognizer may perform a check on the rewriting of the former grammars. Of course, such a configuration does not result in a file with the rewritten input (the file with extension .TDO) but an output as requested by the last grammar; rejecting/accepting the transformed input in case of a recognizer, a parse tree in case of a parser and eventual output written by output instructions.

A cascaded grammar can consist of a maximum of 25 individually compiled grammars.

            Remark: The runtime system creates “pipes” between cascaded grammars. That means that as soon as an output symbol becomes available in one cascaded grammar it will be transfered as input to the next cascaded grammar. In that manner the on-line property of the runtime system is maintained.

21.2 Example

Suppose each symbol ‘a’ in the input has to be replaced by ‘aa’. As we have seen earlier, the following innocent looking rule will cause a looping problem:

      'a',  'a'   <::   'a' .

The input of only one "a" will be replaced by two "a" 's, each of which in turn will be replaced by two "a" 's, etc., until the program crashes due to a shortage of memory.

The problem can be solved in the following way:

Grammar1:

            'a', treated, 'a', treated,  End       <::         'a', End .

            End                                          ::=         'a' | cr .

Grammar2:

            <::         treated .            /* This transduction rule has an empty left-hand side */

After linking to a cascaded grammar, the run time system will transform the input without the superfluous intermediate symbol "treated".

 

22              Example 7: Cascaded grammars

22.1 Introduction

In example 6 there are two grammars which have to be run one after each other. In this example 7 we will organize both grammars as cascaded grammars.

Directory Ex_7 contains the compiled transduction grammar Ex_5_td.pg and the compiled recognition grammar Ex_6_possible_solution.pg.

22.2 Gui: Cascade linker

 

 

 

 

 

 

 

22.3 Instructions

-         Be sure that in the Basic Mode of the GUI at “Report file transformation to XML” is selected “Strip first report number”.

-         Press “Advanced mode…”

-         Press the tab “Cascade linker”.

-         In the box “Select Grammar” select Ex_5_td.pg

-         Press “Add to List”

-         In the box “Select Grammar” select Ex_6_possible_solution.pg

-         Press “Add to List”

-         If you want to change the sequence of a compiled grammar then first select it and then use the buttons “Move Up” and/or “Move Down”. Eventually, you may remove a grammar with de button “Remove”.

-         In the box “Output to grammar” select the name of an already existing compiled grammar or type a new name, in this case Ex_7.pg.

-         Press the button “Link grammars”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.

-         Close the Process Monitor.

-         The current combination of filenames is automatically saved in the file “Ex_7.sam”. Later you may load this file with the button “”Load sam-file…”. All filenames will then be restored.

-         Press the tab “Parser”

-         Press “Select inputfile…” and select the input file “Ex_4.txt”.

-         Press “Run grammar…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.

-         Close the Process Monitor.

-         Inspect the output file Ex_4.xml.

-         You are done.

 

23              Syntax: lexicons

Lexicons are mainly used in linguistic grammars (see example 9). However, they can also be used for more mundane tasks like the enumeration of members of a closed set. That is the topic of the following example 8.

Please read first the section in the Reference Manual on lexicons.

When lexicon symbols are used in a grammar, the run time system must use a lexicon. Read here for the construction of a lexicon.

 

In the dtd of our running example (ex4.dtd) there are two kind of identifiers, TID for “task identification” and STID for “subtask identification”. They are defined as CDATA, which is very general. Suppose we want to do a rigorous check on the value of the identifiers. We could enumerate them in the grammar ex4.grm, but this will be impractical in the case of a large set of values. Moreover, any time the set changes we have to recompile the grammar.

For the sake of demonstration we will construct a lexicon with all the allowed values of TID and STID (example 8a) and will adjust the grammar in order to use that lexicon (example 8b).

24              Example 8a: the construction of a lexicon

24.1 Maintenance of a lexicon

In LingDoc Transform a lexicon is maintained as a text file. It is compiled into a compiled lexicon that can be used by the runtime system. The compiled lexicon uses the datastructure of a trie on external memory in order to support the on-line property of the runtime system. The lexicon compiler may also add new entries to the compiled lexicon.

This mechanism for maintenance is rather crude. A more sophisticated tool for lexicon maintenance is Lexbench, part of LingDoc Clarity. It uses a database for the storage of the lexicon. If you want to use one of the lexicons of Clarity in Transform than you may export the contents of that lexicon as a text file in order to compile it as a lexicon for Transform.

At this moment the syntax of a lexicon in Transform is slightly different from the syntax of an exported lexicon of LexBench. In the near future they will become the same. In the mean time you may write a small conversion script by your own or obtain one from Palstar (of course written as a Transform grammar).

24.2 Lexicon for the running example

For our running example we define the grammar symbol $number_lex as lexicon symbol for the values of the attributes ID of element TID, ID of element STID and MREF of element MAT. For the sake of demonstration we list in the lexicon all values of the attributes that appear in the document, together with an attribute which indicates the type of the value (also for the sake of demonstration). The lexicon is contained in file Ex_8\lex\ex_8_lex.txt:

 

$number_lex('52-43-01-000-800','t')

$number_lex('52-43-00-000-800','t')

$number_lex('52-43-01-010-001','s')

$number_lex('52-43-01-020-001','s')

$number_lex('52-43-01-110-001','s')

$number_lex('52-43-01-350-001','s')

$number_lex('05-024','m')

$number_lex('11-003','m')

 

24.3 Gui: compilation of a lexicon

 

 

 

 

 

 

 

24.4 Instructions for compiling a lexicon

-         In the GUI press “Advanced mode…”.

-         Press the tab “Lexicon compiler”.

-         For “lexicon source” select Ex_8\lex\ ex_8_lex.txt.

-         For “compiled lexicon file” select  Ex_8\lex\ex_8_lex.ltr.

-          Note: if this is an existing compiled lexicon (with extension “.ltr) then the entries in ex_8_lex.txt will be added. If you want to start with an empty compiled lexicon than you have to delete the existing compiled lexicon beforehand. If the compiled lexicon does not exits you will have to navigate to the correct directory (here: Ex_8\lex) and to type in the name by yourself, with or without the extension “.ltr”.

-         Press the button “Compile lexicon…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.

-         Close the Process Monitor.

-         You are done.

25              Example 8b: conversion with a lexicon

25.1 Source document

The source document is Ex_8\doc\ex4.txt (which is the unmodified source document of Ex_4).

25.2 Grammar

The grammar is Ex_8\gram\Ex_8.grm, wich is Ex_4.grm, but partly modified in order to accommodate the lexicon symbols. The changes are:

 

--old:

TID(O:ID) ::=

      't' {ID = %} number(I:N) {ID = ID||N}

      ( '-' {ID = ID||%}

        number(I:N) {ID = ID||N}

      )*.

 

--new:

TID(O:ID) ::=

      't' {ID = %}

      Numbers_with_hyphens(I:st, I:ID1)

      {ID == st; ID = ID || ID1}.

 

--old:

RT    ::=

      '(refer to TASK '

      't'

      Numbers_with_hyphens(I:rid)

      ' / page book '

      number(I:page)

      ')'

      {R:1:'<RT RID="t' || rid ||'" PAGE=" ' || page || '"/>'} .

 

--new:

RT    ::=

      '(refer to TASK '

      't'

      Numbers_with_hyphens(I:st, I:rid)

      {st == 't'}

      ' / page book '

      number(I:page)

            ')'

      {R:1:'<RT RID="t' || rid ||'" PAGE=" ' || page || '"/>'} .

 

--old:

Numbers_with_hyphens(O:ID) ::=

      number(I:N) {ID=ID||N}

      ( '-' {ID=ID||%}

        number(I:N) {ID=ID||N}

      )*.

 

--new:

Numbers_with_hyphens(O:st, O:ID) ::=

      $number_lex(ID, st).

25.3 Instructions for using the lexicon

-         In the GUI, in the Basic Mode, select as grammar Ex_8\gram\Ex_8.grm.

-         Compile the grammar.

-         Press the button “Advanced Mode”, press the tab Parser.

-         Select as inputfile Ex_8\doc\Ex_4.txt.

-         Select (at the left side of the pane) as lexicon: Ex_8\lex\ex_8_lex.ltr.

-         Select (in the middle of the pane) the checkbox “lex_increment”.

-         Press the button “Run grammar…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.

-         Close the Process Monitor.

-         You are done. The output is in Ex_8\doc\Ex_4.xml.

26              Example 9: Natural language parsing

26.1 Introduction

In the position paper “4.LingDoc and Grammar Formalisms” an overview is given of the capabilities of Transform for natural language parsing. In this example a demonstration is given with as input the natural language sentences in our test document. The grammar is kept as short as possible. The words in the sentences are contained in a lexicon.

26.2 The input

In computational linguistics it is customary to study a natural language sentence on its own, stripped of markup and material that is only meant for presentation. A grammar covers one sentence. This is a linguistic approach. The handling of characters for structure and presentation is delegated to the lexical scanner.

In the exercises up till now, a grammar in Transform covered a complete document in which natural language is embedded. This is a document oriented approach. In an XML document the natural language is contained in the PCDATA and is not furher marked up.

In Transform, a number of switches has been introduced in order to accommodate either a pure document oriented approach, a pure linguistic approach or a combination of both.

The document oriented case we have met up till now. The linguistic oriented case will be dealt with in this example 9. The combination of both will be met in example 10.

The input in this example (in file doc\ex9.txt) has been constructed by the former grammars. We will describe and parse the sentences which are contained in the PCDATA.

26.3 Switches used

In this example the grammar covers one sentence. Blanks (spaces and tabs) between words are ignored and do not have to be described by the grammar.

The following switches are of interest. Check if they are set in the tab Parser. The GUI writes the values of the switches to gram\ex_9.sw.

single_lines=true                    -- each natural language sentence is expected to be on a single line, delimited by an end-of-line character. The grammar concerns one complete sentence. The parser will process all sentences in the source document.

ignore_blanks=true                -- blanks (spaces and tabs) are ignored.

build_parsetree=true              -- parsetrees are constructed, in file ex_9.out.

lex_increment=false               -- the scanner reads a complete word before it is looked up in the lexicon.

lex_chars=a..z | A..Z | 0..9     -- these characters may appear in the lexicon.

26.4 Source document (ex_9.txt)

Remove Hatch Seals.

Get access.

Open the hatches and remove them.

Put the hatches down on a soft mat.

Remove the hatch seals.

On each hatch there are 42 screws.

Remove the screws.

Remove the retainer strips.

Remove corrosion.

Remove any corrosion with abrasive cloth.

Clean the hatch.

Be careful when you use the methylethylketone.

If you get mek in your eyes.

Flush your eyes with water.

Get medical help.

The mek is poisonous.

Clean the areas with mek.

26.5 Grammar (ex_9.grm)

S ::=

            SENTENCE.

SENTENCE ::=

            IMP_S

            (ADV_SUB_F)?

            EOS

            |

            ADV_SUB_F

            (PP_S)?

            EOS

            |

            SVO

            EOS.

IMP_S ::=

            $v_np()

            NP(dummy)

            ($adv_a())?

            ($part())?

            (PP_S)?

            |

            $v_copi()

            $adjpre().

SVO ::=

            NP(dummy)

            $v_np()

            $adjpre()

            |

            (PP_S)?

            $there()

            $v_copi()

            NP(dummy)

            |

            $pers()

            $v_np()

            NP(dummy).

PP_S ::=

            $prep_s()

            ($v_np())?

            NP(dummy).

NP(O:num) ::=

            (DETP)?

            NOUN(num)

            |

            QUAN(num)

            NOUN(num1) {num==num1}.

DETP ::=   

            $artdef() | $art_i() | $predet().

NOUN(O:num) ::=

            $noun(dummy1,num1)

            $noun(dummy2,num)

            |

            $dem(dummy,num)

            |

            ($adjpre())?

            $noun(dummy,num).

QUAN(O:num) ::=

            $quan(dummy,num)

            |

            $card(dummy,num).

ADV_SUB_F ::=

            $sconj()

            SVO.

EOS ::=     ('.' | ':' | ';').

26.6 Lexicon (ex_9_lex.txt)


$adjpre('abrasive')

$adjpre('careful')

$adjpre('clean')

$adjpre('medical')

$adjpre('necessary')

$adjpre('open')

$adjpre('poisonous')

$adjpre('soft')

$adjpre('your')

$adv_a('if necessary')

$artdef('the')

$art_i('a')

$card('42','p')

$dem('them','p')

$noun('access','s')

$noun('areas','p')

$noun('cloth','s')

$noun('corrosion','s')

$noun('eyes','p')

$noun('hatch','s')

$noun('hatches','p')