LingDoc Transform
A tool for document and language
engineering
User Manual
Version 3.2
Palstar
Uffelte
The
www.lingdoc.eu
www.palstar.nl
Contents
2........... Working Procedure
3........... Install Transform
4........... Directory structure
5........... Explore the Interface
7........... Syntax: EBNF grammars
and Regular Expressions
7.1 The formalism of Transform in a
technical nutshell
7.1.3 When to use recognition and when
transduction grammars
8........... A running example
9........... Example 1: getting
started
9.1 Source document (Ex_1.txt)
10.......... Syntax: extensions for
output, variables and expressions
11.1.2 Messages with variables
12.......... Example 2: Output
12.1 Source document (Ex_2.txt, same as
Ex_1.txt)
12.2 DTD (Ex_2.dtd, same as Ex_1.dtd)
12.3 Grammar (Ex_2.grm, as an extension
to Ex_1.grm)
12.4 Target document (Ex_2.xml)
13.......... Syntax: Variables and
expressions
13.2 Variables, literals and expressions
13.4.3 Combinations of tests and/or
assignments
14.......... Example 3: Variables
14.1 Source document (Ex_3.txt)
14.4 Target document (Ex_3.xml)
15.......... Methodology for
up-conversion: from DTD to grammar
16.......... Example 4: The complete
conversion
17.......... Syntax: Extensions of
context-sensitive grammars
18.......... Syntax: Extensions of
transduction grammars
18.1.3 Ambiguity and "looping"
problems in transduction grammars
18.2 More reading on transduction
grammars
19.......... Example 5: Floating
elements
19.2 Source document (Ex_5.txt, same as
Ex_4.txt)
19.4 Target document (Ex_5.tdo)
20.......... Example 6: Two-stage
conversion
21.......... Syntax: Cascaded
grammars
22.......... Example 7: Cascaded
grammars
24.......... Example 8a: the
construction of a lexicon
24.2 Lexicon for the running example
24.3 Gui: compilation of a lexicon
24.4 Instructions for compiling a lexicon
25.......... Example 8b: conversion
with a lexicon
25.3 Instructions for using the lexicon
26.......... Example 9: Natural
language parsing
26.4 Source document (ex_9.txt)
26.7 Instructions for running the example
27.......... Example 10: Structure
sensitive Natural Language parsing
27.3 Source document (ex_10.txt)
27.5 Lexicon (same as ex_9_lex.txt)
28.......... Syntax: Ambiguity and
non-determinism
29.......... Example 11: Ambiguity
29.2 Source document (ex_11.txt)
30.1 Run Time Input and Output files
30.4 Drawing (ambiguous) parsetrees
30.5 Drawing (ambiguous) transduction
output
LingDoc Transform (in short: Transform) is a general-purpose system for the analysis of natural language texts and for the conversion of documents. It is introduced in the “Management Summary of LingDoc Transform” and in a number of position papers.
This User Manual tries to explain the working of Transform in the shortest way possible and provides for immediate hands-on experience. It explains the syntax and shows a number of examples in the form of exercises. You may modify the exercises and try them out on your computer. There is one running example: the analysis and conversion of an aerospace manual into XML, together with the linguistic analysis of its text. The running example gives a comprehensive overview of the capabilities for up-conversion to Xml and for language engineering tasks. Along the way references are made, by hyperlinks, to the accompanying Reference Manual. That manual gives a rather exhaustive description of the complete system in its own pace.
This User Manual is aimed at the more experienced user who already has a basic understanding of the syntax of XML DTD’s, EBNF and regular expressions. We advise the impatient reader, experienced or not, to follow this User Manual and to use the hyperlinks in case something is not clear or not precise enough. We welcome any suggestions for further improvement of these manuals.
The Transform specify-parse-adjust-convert cycle enables you to easily specify the structure of a document, parse it, adjust it and then convert it to valid XML. Here’s how it works:
3. ADJUST GRAMMAR AND DOCUMENTS 2. PARSE DOCUMENTS 4. CONVERT DOCUMENTS TO XML 1. SPECIFY GRAMMAR FOR DOCUMENTS
Obtain a number of source documents together witch their target documents in XML together with the document schema. Convert the schema, in a one-to-one correspondence, into a grammar in the W3C notation. The end symbols of the grammar are the characters in the source documents. Add regular expressions to fine-tune to the peculiarities of the source documents.
Compile the grammar into a parser. Parse the documents and let errors be reported.
Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.
Add output-specifications to the grammar. The documents will be converted into XML. Alternatively, whichever is easier, specify in Transform a second type of grammar which contains rewriting patterns. Or, use your own rewriting program. In the last case, you use Transform mainly as a tool for getting documents correct, using the cycle 1-2-3.
For the installation of Transform from the CD: follow the instructions in the file Install.txt.
Two directories will be available: Transform and Java. The directory Java will only be used by the Gui. The directory structure of Transform is as follows (a larger font indicates its relevance for the user):
BIN |
contains the
executables for the system Transform |
ETC |
contains a text
file for error-messages, to be used by the system |
Examples_more |
contains additional examples, referred by the User Manual |
Examples_myself |
the directory you will work with |
Examples_User_Manual |
initially the same as Examples_myself; in case you mess up your work you can always copy the original |
GRM |
contains the
bootstrap grammars for the system (the formalisms used within the system are
described by Transform grammars themselves) |
Gui |
contains the
java classes for the graphical user interface |
Manuals |
contains the User Manual and the Reference Manual |
several .bat
files |
for running the
system from the Gui or in batch |
click-for-Transform |
a Windows pif file; click on it to start the Gui; you may copy this pif file to the desktop for easier use |
read.me |
instructions
for adjusting path names; initially, there is no need to do this |
Start the Gui of Transform by clicking on “click-for-Transform”. This brings you in the main menu, also called “Basic Mode”. The menu contains comments on each of the six buttons. The buttons are further explained hereafter.
The behavior of the compiler and the run time system can be adapted to the wishes of the user; some options can be specified to modify the behavior of the programs to some extent. The options are set by switches in the GUI. Nearly all switches may be specified and altered using the Advanced mode.
The switches are kept in a so-called switches file, one for the compiler and one for the run time system. The switches files have the extension .SW. Each switch has a default value; if a switch value is not specified by the user, this default value is assumed.
A switches file is maintained automatically by the GUI, but can also be altered by hand.
Some settings of switches are related. The GUI takes care of the automatic change of the setting of a switch in case the setting of a related switch is altered.
Clicking in the Main menu on “Advanced mode…” will bring you to the menu “Advanced Mode”, where you can select five tab pages. In general, you can set a number of switches for the compiler and the parser, disassemble a compiled grammar, compile a lexicon and link several grammars together. Moving the cursor over one of the switches will reveal a short explanation.
For each exercise, the values of the switches are contained in the switches’ file, with extension .sw.
However, for doing the exercises in this User Manual it is not necessary to use the Advanced Mode.
The Advanced Mode is explained in more detail in the Reference Manual.
You are encouraged to experiment with the examples in this User Manual. In doing so you will operate on text files. You may use your favorite text editor (a simple editor like Wordpad will be sufficient.)
In Windows, use “mapoptions” to associate the extensions of the relevant input and output files of Transform to your editor. The extensions are:
- dtd (this type could also be associated with an XML-editor)
- grm (the grammar file)
- tdo (the target document created by the transducer)
- txt (the source document)
- xml (the target document) (this type could also be associated with a browser like Internet Explorer)
A few more types of files are created by the compiler and the run time system. However, you do not have to inspect or alter them. (If you are interested: read more on files related to compilation and files related to the runtime system.)
The operation of the Transform system is described here. A description is given on types of files, error handling, batch operation and screen output.
The basic syntax of Transform is that of EBNF grammars, together with Regular Expressions for terminals. The symbols for EBNF are the same as for content models in XML DTD’s.
A grammar consists of a set of rewrite rules, which have a left hand side (“lhs”) and a right hand side (“rhs”). The set is unordered.
The basic syntax is extended in a number of ways:
- everywhere in a rhs so-called “Actions” may be specified, which contain small programs; the programs use variables which scope is the grammar-rule. With the aid of variables one may construct the target document.
- nonterminals may have parameters, which may be the variables used in the Actions. Parameters are used to distribute components of the source document, e.g. for the construction of XML attributes or for re-ordering purposes.
If the lhs of a rule contains one symbol it is called a context-free rule; if it contains more than one symbol it is called a type-1 or 0 Chomsky-grammar (type 0 if the lhs is longer than the rhs).
Type-0 rules may be written as transduction rules. In that case a rhs becomes a pattern that, if located in the source document, will be rewritten according to the lhs.
If the compiler switch Transduce is set to true then a grammar will be treated as a transduction grammar.
Transduction rules behave completely different from recognition rules, while the syntax is nearly the same. All transduction rules run in parallel. The output of a rule is again subject to the matching of all rules.
Recognition rules describe precisely the sequence of characters within the source document. For each notion in the grammar its context is automatically determined by the whole grammar. The parsing of a source document with a recognition grammar may fail or succeed.
To the contrary, the transduction of a source document always succeeds. This is the way XSLT and tools based on pattern-matching operate.
With transduction rules you may delete, insert, replace and reorder components of the source document within context. Sets of transduction rules may be cascaded in order to provide for a number of conversion phases within one run of Transform.
Transduction rules may contain non-terminals which are, in their turn, defined by recognition rules.
Transduction grammars can be made very powerful. To get an impression: see the examples in the directories examples_more\cijfers and examples_more\chiffres on the translation of numbers into Dutch and French alphabetic representations.
It is advised to use recognition grammars in the first stage of a conversion process. With a recognition grammar you may specify the syntax of the source document on the character level. All inconsistencies and errors will be reported. In that case either the grammar or the source document will have to be adjusted.
When recognition succeeds one may proceed in two ways. The recognition grammar may be supplied by actions and parameters in order to produce the output document. On the other hand one may reorganize the recognition grammar into a transduction grammar. In the case of up-conversion to XML the first strategy suffices most of the time.
In case the structure of the source document is quite different from the structure of the target XML document one may first specify a document schema according to the structure of the source document and convert according to that schema. In a second stage XSLT may be used for the conversion to the target document.
In the case of up-conversion of “floating elements” into an XML mixed content model the use of recognition and transduction grammars may be combined. Floating elements escape, by definition, a rigorous syntactic description of their position in a document. A limited transduction grammar may then be used to identify floating elements in the source document and to supply them with XML tags (and attributes). This can be done in a pre- or post processing stage adjacent to the recognition stage. This will be demonstrated in example 5.
Not all extensions to the syntax can be used together; the allowed combinations are listed. Read more on the allowed combinations.
The meta symbols used in Transform grammars for EBNF notation are listed below, together with their names and a short description of their functions:
Delimiters and operators used in rules are:
name |
meta symbol |
function |
rewrite sign |
::= |
recognition: written
between left- and right-hand side |
transduction sign |
<:: |
transduction: written
between left- and right-hand side |
concatenation sign |
, |
concatenation of symbols
(optional) |
alternative sign |
| |
between alternatives |
end of rule sign |
. |
end of a rule |
comment signs |
/* and */ |
start and end comment |
action brackets |
{ and } |
between these brackets actions
are notated |
Regular expressions are used for the description of strings of characters as well as for the grouping of symbols, like in an XML content model. Reserved symbols for regular expressions are:
name |
meta symbol |
function |
regular expression open |
( |
open bracket of a regular
expression |
negated reg. expr. open |
`( |
open bracket for a negated
regular expression |
r.e. close zero or one |
)? |
zero or one times the
enclosed part |
r.e. close one |
) |
one time the enclosed part |
r.e. close zero or more |
)* |
zero or more times the
enclosed part |
r.e. close one or more |
)+ |
one or more times the
enclosed part |
The comma in between symbols is optional. For readability, whitespace may be added freely. Empty rules are allowed: in that case the rhs is absent. All types of recursion are allowed (left-, middle- and right-recursion).
For
examples: see grammar Ex_1.grm below.
All characters (including the characters that have a meta function), will be recognized by the system as terminals if they are surrounded by single quotes, for example:
Vowel ::= 'a' | 'e' | 'i' | 'o' | 'u' .
s ::= '
' |
#x09 | CR. /* space, TAB or CR
*/
Quote ::= ''''
| '"'.
/* The single quote itself
must be doubled. */
The reserved symbol "cr" or "CR" matches an end of line character.
All
terminal symbols may also be represented by the number representing their
underlying code. A value in the (decimal) range ‘
In summary, there are alternative ways to represent a single character in a grammar:
"name
characters" like "a" |
"single
quote" character |
"end
of line" character |
other
characters like "." |
'a' |
'''' |
CR
or cr |
'.' |
#x61 |
#x27 |
#x0D |
#x2E |
The matching of characters is by default case sensitive. This may be suppressed by a compiler switch.
Sequences
of terminal symbols may be given as a string between single quotes. For
example:
abcString ::= 'abc' .
is
another notation for:
abcString ::= 'a', 'b', 'c' .
or
for:
abcString ::= 'a' 'b' 'c' .
A
range of characters is specified like in
Consonant ::= [b-d] | [f-h] | [j-n] | [p-t] | [v-x] | 'z' .
Brackets ::= [#x28-#x29] . /*
"(" and ")" */
The
negated variants are also supported: [^a], [^a-z], [^#xN] and [^#xN-#xN].
Note: the syntax [0-9a-z] is currently not allowed. It has to be written as [0-9] | [a-z].
Wildcards
are meta symbols which match arbitrary characters (including carriage return,
form feed etc.). There are three different wildcards, with different functions:
a dontcare, a line and an arb symbol. They are listed
below, together with their functions.
symbol |
name |
function |
* |
dontcare |
matches exactly one
arbitrary character |
= |
arb |
matches zero or more
arbitrary characters, not including the character following the arb symbol in
the grammar |
‑ |
line |
matches any sequence of
zero or more arbitrary characters, including occurrences of the character
following the line symbol in the grammar |
For example:
Password ::= *, *, *, *, *, * .
defines
a password as a sequence of six arbitrary characters. And in:
Comment ::= '/*',
=, '*/' .
a
sequence of characters which starts with ‘/*’ and ends with ‘*/’ is recognized
as a comment. The characters which are matched by the arb symbol will not
contain a ‘*’. An alternative definition of comment:
Comment ::= '/*', ‑, '*/' .
defines
the same sequence of characters, but the characters which are matched by the
line symbol may contain one or more ‘*/’. For example:
/* a comment is marked by a "*/"
at the end */
will
be matched twice by the rule with a line symbol. It will only be matched once
by the rule with an arb symbol.
Non-terminal
symbols are symbols that occur at least once on the lhs of a context free rule.
Their names consist of a sequence of one or more alphabetic characters
(uppercase or lowercase, case sensitive), the digits 0 to 9 and the underscore
character ("_"). The maximum length of a non-terminal name is 32 characters.
The
components of regular expressions may be terminals as well as nonterminals.
Nesting of regular expressions may be of any depth. Regular expressions can not
be used on the left-hand side of grammar rules.
In the sequel we will gradually build up a complete example of the conversion of an aerospace manual.
The examples are in directory “Examples_User_Manual” and are called Ex_1, Ex_2 and so on. The corresponding filenames have self-evident extensions. The complete example is in exercise 4. Subsequent examples contain variations.
For each example we give, if applicable, an explanation, the source document, the XML DTD, the Transform grammar and the target document in XML.
Within the examples the following elements and attributes are used:
Element |
Element description |
Attributes |
Attribute description |
LI |
list item |
|
|
MAT |
material
(floating element) |
MREF |
reference to
the material |
NOTE |
note (floating
element) |
|
|
RT |
refer to task |
RID, PAGE |
reference to a
task and to a page number |
STID |
subtask
identification |
ID |
identification
of a subtask |
STT |
subtask title |
LTR |
sequence
indicator (A, B etc) |
SUBTASK |
subtask |
|
|
TASK |
task |
|
|
TID |
task
identification |
ID |
identification
of a task |
TT |
task title |
|
|
WAR |
warning
(floating element) |
|
|
The first example serves as a kind of “Hello word” example. A tiny fragment of the source document will be used to give a quick hands-on experience. The target document only contains start- and end tags.
This
example shows the first two lines of our example document. They are in file
Ex_1.txt in directory Ex_1.
TASK
Remove
Hatch Seals
The
DTD of the target document is contained in file Ex_1.dtd:
<?xml
version="1.0" encoding="UTF-8"?>
<!ELEMENT
TASK (TT)>
<!ELEMENT
TT (#PCDATA)>
The
grammar which describes Ex_1.txt is in file Ex_1.grm. For training purposes it
is of greater detail then necessary to identify the #PCDATA. The grammar
specifies also the characters, words and whitespace within the #PCDATA.
Starting with Ex_1.dtd you may arrive at Ex_1.grm by following the next steps (which will be augmented in chapter 15).
-
rewrite the DTD in Transform grammar notation:
TASK ::= TT .
TT ::= PCDATA.
-
add terminal symbols which are associated with the start and the end of the
elements
TASK ::= 'TASK', TT.
-
refine PCDATA
PCDATA ::= token, (
whitespace, token )*.
-
refine token
token ::= ( char )+.
-
add allowable whitespace, like in:
TASK ::= 'TASK', whitespace, TT.
-
add regular expressions for char and whitespace.
whitespace ::= ( s )+.
char ::= [#x00-#x08] | [#x0A-#x0C] | [#x0E-#x1F] |
[#x21-#x7F].
/* all characters, except space, TAB and CR */
s ::= ' ' | #x09 | CR. /* space, TAB or CR */
Format the completed grammar in an easy-to-read
style, for instance like:
TASK
::=
'TASK'
whitespace
TT.
TT
::=
PCDATA.
/*
symbols in document: */
PCDATA
::=
token
( whitespace
token
)*.
token
::=
( char )+.
whitespace
::=
( s )+.
char
::=
`(' ' | #x09 | #x0D). /* all characters, except space, TAB and CR
*/
s
::=
' ' | #x09 | CR. /* space, TAB or CR */
We
will now run the compiler and the parser.
-
In the GUI press “Select grammar…” and select the file Ex_1.grm. The
selected filename will show up in the status bar on top of the GUI. Transform
will memorize the filename in between sessions until another filename is
selected.
-
press “Compile grammar…”. A command box pops up and a number of messages appear
about the deletion of intermediary files. Then the current values of switches
for the compiler are shown. If everything goes well you get the message: “#
Command completed, exit status =
If there are errors in the grammar they will be shown in the command box as
well as in the file Ex_1.err.
-
press “Select input file…” and select the source document Ex_1.txt. The
selected filename will also show up in the status bar on top of the GUI. Close
the command box.
-
press “Run grammar”. Again a command box shows up. If everything goes well
you get the messages “Input correct.” and “exit status =
-
you are finished. Because the grammar contains no output-instructions no
target document will be created.
You may now start your first experiments. Change the grammar and the source document in a controlled way and see what is happening. If there are errors in the grammar or in the source document Transform will try to recover and to report as many errors as possible.
Actions relate to extensions of the formalism. The main types of actions are:
(1) messages
(2) assignments and tests
The general rules for Actions are:
‑ Actions are written between braces
"
‑ Apart from assignments and tests, an action starts with a keyword determining the type of the action. These keywords are "r:" and "s:" and may be written in upper or lowercase without difference in meaning.
- Actions may be conjoined by the conjunction operator (semicolon ";" or ampersand "&").
- Actions may only be used on the right-hand side of a context free rule. They may occur:
o after each symbol or regular expression
o between the ")" and the " * " or "+" of a regular expression
o as the first (or only) element of an alternative.
For some types of actions the effects depend on their position in the rule.
The target XML document may be created by the use of so-called “messages”. Messages mark the relevant points in the recognition process by which other (external) processes can be triggered. Messages provide for event-based processing in general purpose programming languages, like the SAX API’s.
An extensive description of messages is given here. For the creation of a target XML document the following description suffices.
When messages are used in a grammar, an output file with the extension .REP (also called "report file") is created in which the messages are written. (If the numbers of the reports are suppressed then the extension will be .XML.) There are two variants: simple messages and messages with variables.
A message is always written within an action as:
{
r: <number> }.
Reports or r‑messages
report the number followed by a colon plus the most recently matched terminal symbol.
In the case of subsequent r‑messages with the same report number, the
report number is given once and the reported symbols are concatenated (even if
they were not contiguous in the input). This mechanism unifies the creation of
either a target XML document or a file containing markers for event-based
processing.
R‑messages
always output the character that has most recently been read. Thus, if a report
follows a symbol or regular expression that covers no input (as may be the case
in the "( )" and "( )* " expressions), the character that
was read before the regular expression will be output.
In the grammar:
tokens
::= token,
( separators, token )*.
token
::= (
char {R:1})+.
char ::= [a-z] .
separator
::= '
' |
#x09. /* space or TAB */
any sequence of lowercase alphabetic characters will be recognized as a token and will return as output a report file with one line. It consists of the report number "1" followed by a colon and all recognized characters, since subsequently reported terminal symbols with the same number are concatenated.
For example, for the string "abc def " the output will be:
1:abcdef
The separator “space” will not be reported.
With the runtime switch CASE_ CONVERT, with values LOWER_ TO_ UPPER and UPPER_ TO_ LOWER, one may convert the case of the input.
A variant of the simple r‑messages can be used to communicate the value of expressions with variables. In this use, the report or signal number is followed again by a colon followed by an expression:
The use of variables and expressions will be described in the sequel.
The effect of this action is that as soon as the input is recognized up to the point of the action, the current value of the expression will be written to the report file, preceded by the number and a colon. The same rule for concatenation is followed as for simple messages.
Expressions may contain literal string values. Therefore, the action
{R:1:'<TASK>'}
will
write the tag <TASK>.
The next example (2) is an extension of example 1. Messages are added and a target XML document is created. A string of whitespace characters is replaced by a single blank.
TASK
Remove Hatch Seals
<?xml
version="1.0" encoding="UTF-8"?>
<!ELEMENT
TASK (TT)>
<!ELEMENT
TT (#PCDATA)>
TASK
::=
'TASK'
{R:1:'<TASK>'}
whitespace
{R:1:'<TT>'}
TT
{R:1:'</TT></TASK>'}.
TT
::=
PCDATA.
/*
symbols in document: */
PCDATA
::=
token
( whitespace
{R:1:' '}
token
)*.
token
::=
( char {R:1})+.
whitespace
::=
( s )+.
char
::=
`(' ' | #x09 | #x0D). /* all characters, except space, TAB and CR
*/
s
::=
' ' | #x09 | CR. /* space, TAB or CR */
<TASK>
<TT>Remove Hatch Seals</TT>
</TASK>
In Transform
grammars, two kinds of variables can be used: local variables and
parameters. The main difference between the two types lies in their scope; the
scope of a local variable is the rule in which it occurs, a parameter can be
used to transmit information between different rules. All operations that can
be performed on local variables are applicable to variables introduced as
parameters as well.
Every cf rule, except context sensitive and transduction rules, may
be extended with local variables. Information (in the form of a string)
available at some point in the rule can be stored in a variable so that it can
be used at another point in the same rule. A variable has a name and a value (which is
always a string).
A
local variable need not be declared; the variable will be allocated at the
moment of its first reference in the rule. At the moment of its allocation, it
contains the empty string and this remains its content up to the moment that
another value is assigned to it.
There
is a reserved variable name (percent "%"). Its value is always
the character that has most recently been read.
Besides
variables, literal string values (also called literals) can be used in actions.
Literals are strings of characters between single quotes. The empty string can
be represented by merely two quotes. To represent a single quote as part of a
literal, it should be written twice.
Expressions can be built from variables and literals. This is done with
the concatenation operator "||" or “+” (equivalent). The value of an
expression is the concatenated value of all its components. (The term "expression" refers also to its most simple form:
a simple variable or a simple literal.)
As an example, we give the value of some expressions that can be composed from the following components:
var1 with the value "aa"
var2 with the value "bb"
% with the value 'c'
the literal 'dd'
expression |
resulting
value |
var1 || var2 |
"aabb" |
var1 || % |
"aac" |
'dd' || var2 |
"ddbb" |
var1 || % || var1 |
"aacaa" |
A variable may be assigned a value by means of the assignment operator ("="); in general:
{ <variable> = <expression> }.
The value of a variable can be set in different ways, for example:
{var1=var2} or {var1=%} /* assignment of a variable */
{var1='abc'} /* assignment of a literal */
{var1=var2||'abc'} /* assignment of a expression */
The value of a variable may be empty at any time. Remember that a variable has an empty value at the moment of its first appearance in a rule; its value can become the empty string again by assigning the literal representing the empty string to it. Therefore:
{var=''}
removes the value of the variable "var".
Explicit
assignment to the reserved "%" is not possible; this variable is
assigned its value, the last read character, in run time.
A powerful mechanism is the application of tests: the recognition of input can be made dependent of a testable criterion. This feature can be used successfully in grammars that describe ambiguous phenomena and can limit the number of alternative parse trees.
Tests are performed on variables and literals, which can be combined together into expressions
Tests concern the truth value of a comparison between two expressions. If this value is false, the current recognition path stops. There are two test operators, one for equality and one for inequality.
The equality test:
{ <expression1> ==
<expression2> }
returns
the value true if "expression1" has the same value as
"expression2", while the inequality test:
{ <expression1> !=
<expression2> }
returns
the value true if "expression1" refers to a value other than that
specified by "expression2".
Disjunct tests may be stated by means of the "or"-operator:
is
true when the evaluation of at least one of the tests leads to the value true.
For example:
{ var1== '' | var2==% }
is
true if the value of the variable "var1" is empty or if "var2" has
the value of the last read character in the input.
Multiple assignments may be
conjoined by the conjunction operator (";" or "&"):
{var1=expr1; var2=expr2} or {var1=expr1 & var2=expr2}
Multiple tests may also be conjoined by the conjunction operator:
{var1==expr1 & var2!=expr2} or {var1==expr1; var2!=expr2}
is
true when all of the tests are true.
The
two operators behave identical. It is customary to separate assignments by the
“;” and to separate tests by the “&” operator.
A non-terminal symbol may be supplied with parameters, which can be
viewed as attributes of the symbol (like in attribute grammars). They are
declared together in a so-called formal parameter list of a non-terminal symbol on the left-hand side
of a context free rule. There are three types of parameters:
i (input) : taking (inheriting)
o (output) : giving (synthesizing)
io (input/output) : taking and giving
The keywords "i", "o" and "io" are case insensitive.
Once declared, the type of the parameters is fixed. The type of the parameter must be specified in the formal parameter list, followed by a colon, followed by one or more names of parameters, separated by a comma. A parameter list can contain parameters of one, two or three types. There is no prescribed order if more than one type is used. Each type indicator is separated from the preceding list of names by a comma.
Examples of formal parameter lists are:
Verb(I:MainVerb) ::= ...
Symbol "Verb" has one parameter of which the
value is determined in the calling rule; the value of "MainVerb" is
not affected after application of the present rule.
VsMod(O:type) ::= ...
Symbol "VsMod" has one parameter of which the value is initially an empty string; the value of "type" is determined during the application of the present rule.
Noun(IO:cat) ::= ...
Symbol "Noun" has one parameter of which the value is initially determined by the calling rule; the value of "cat" may be changed after application of the present rule.
RelClause(I:Antecedent,O:type,IO:cat) ::= ...
Symbol "RelClause" has three parameters, one of each type.
NP(O:cat,number,pers) ::= ...
Symbol "NP" has three parameters, all of the same type.
When
the non-terminal is used on the right-hand side of one or more rules, the
parameters must be repeated in a so-called actual parameter list. The names of the actual
parameters can of course be different from the names used in the formal
parameter list. In the actual parameter list, it is allowed to repeat the type
but it is not necessary. If the type is indicated on the right-hand side, it
must correspond to the type within the declaration. So, one can choose between,
for instance:
...
NP(cat, number, pers> ... and ... NP(O:cat, number, pers> ...
The use of parameters enables the exchange of information between rules. For example, number agreement of subject and verb can be handled with the use of parameters in combination with a test action:
Sentence ::= NP(number1),
VP(number2) {number1==number2} .
NP(O:number) ::= ... .
VP(O:number) ::= ... .
The starting symbol can also have output parameters (but no input or input/output ones). The content of these top level parameters can be written to the report file and/or the screen with the run time switch: OUTPUT_ PARAMETERS.
The same operations that are allowed on variables like assignments and tests are allowed on parameters. In a rule with a formal parameter list, local variables can be introduced as well.
Recommendation: The compiler distinguishes parameters from local variables by examining the formal parameter list. Thus, when a parameter name is misspelled (due to a typing error), it is silently considered to be a local variable. Therefore, it is advisable to set the compiler switch CHECK_VARIABLES to TRUE especially in early stages of the development of a grammar. Then, for every rule in the grammar, the local variables will be written to the compiler error file.
Restriction: Parameters can only be used if FINITE=FALSE. The reason is that, if FINITE=TRUE, the compiler will try to create a FSM, which has no memory for variables.
The next example demonstrates the use of variables.
TASK t52-43-01-000-800
<!ELEMENT TASK (TID)>
<!ELEMENT TID EMPTY>
<!ATTLIST TID
ID CDATA #REQUIRED
>
TASK ::=
'TASK'
whitespace
't'
TID(I:ID)
{R:1:'<TID ID="' || ID || '"/>'}.
TID(O:ID) ::=
number(I:N) {ID = ID||N}
( '-' {ID = ID||%} /*
‘%’ denotes the last read character, which is
‘-‘. Its value is concatenated to ID */
number(I:N) {ID =
ID||N} /* ‘number’ delivers the value
of ‘N’, which
is concatenated to ‘ID’ */
)*.
number(O:N) ::= /* the value of N is
constructed by
concatenating all individual digits */
( d {N = N||% })+.
d ::=
[#x30-#x39].
whitespace ::=
( #x09 | #x20 | CR )+ .
<TID
ID=”52-43-01-000-800”/>
After the introduction of the main parts of the formalism of Transform it is now time to show the complete example on the up-conversion of the aerospace manual. We follow a general working procedure, which is a refinement of the working procedure already outlined in the position paper on up-conversion to XML.
-
Obtain a number of source documents together with their target documents in
XML
-
Obtain the document schema, preferably in 2 notations:
o
an XML schema notation with restrictions on types
o
the DTD notation in order to reuse the content models (many XML tools can
generate a DTD from an XML schema).
-
Convert the DTD, in a one-to-one correspondence, into a grammar in the
Transform formalism.
o
Rewrite an element definition into a grammar rule
§
the element name becomes the lhs of the rule
§
the content model becomes the rhs of the rule
e.g.
<!ELEMENT
SUBTASK (STID, STT, NOTE?, WAR?, LI+)>
becomes
SUBTASK ::= STID,
STT, (NOTE)?, (WAR)?, (LI)+.
(Note that single symbols which are optional or repeatable have to be surrounded by parenthesis)
o
Rewrite the attributes in an attribute definition as parameters to the
nonterminal which replaces the element name.
e.g.
<!ELEMENT STT
(#PCDATA)>
<!ATTLIST STT
LTR CDATA #REQUIRED>
becomes
STT(O:LTR) ::=
PCDATA {….LTR = …}.
-
Observe the documents and determine in which manner they contain material
that identifies the start and end of elements and the appearance of attributes
and entities.
In general there are 3 types of information:
o
A. strings which mark the structure of elements
o
B. strings which are part of #PCDATA
o
C. strings which are purely used for presentation, like whitespace.
Human intervention in the conversion process will be necessary if
o
some XML markup can not be identified in the documents
o
the same identifying material is used for different markup, and there is no
sufficient context for disambiguation.
-
Add regular expressions to the grammar which describe all variants of the
strings of type A, B and C.
Be precise in the specification of strings of type C; e.g. the following rules are ambiguous:
P ::= Q, R.
Q ::= X,
whitespace.
R ::= whitespace,
Y.
whitespace ::= (
space | tab | cr )+.
This ambiguity can easily be avoided by following the rule that either:
o
no grammar rule starts with strings of type C
o
or no grammar rule ends with strings of type C
-
If an XML schema is available, add regular expressions to the grammar which
account for the restrictions on types in the schema.
Compile the grammar into a parser. Parse the documents and let errors be reported.
Errors may be caused by flaws in the grammar and/or by errors and inconsistencies in the documents. Edit the grammar and/or the documents. The cycle of parse-adjust may be repeated until no errors are reported.
Add output-specifications to the grammar. The documents will be converted into XML.
This example shows the conversion of the complete aerospace manual. Note the correspondence between the DTD and the grammar.
TASK t52-43-01-000-800
Remove Hatch Seals
SUBTASK
s52-43-01-010-001-001
A. Get
access
(1) Open the hatches and remove them
(refer to TASK t52-43-00-000-800 / page book
401).
(2) Put the hatches down on a soft mat.
SUBTASK s52-43-01-020-001-001
B. Remove
the hatch seals.
NOTE
On each hatch there are 42 screws.
(1) Remove the screws.
(2) Remove
the retainer strips.
SUBTASK
s52-43-01-350-001
C. Remove
corrosion
(1) Remove any corrosion with abrasive
cloth (material No. Fk05-024).
SUBTASK s52-43-01-110-001
D. Clean
the hatch
WARNING
BE CAREFUL WHEN YOU USE THE
METHYLETHYLKETONE
(MEK). IF YOU GET MEK IN YOUR EYES:
- FLUSH YOUR EYES WITH WATER
- GET MEDICAL HELP.
THE
MEK IS POISONOUS.
(1) Clean
the areas with MEK (material No. Fk11-003).
For a quick reference we repeat the table with element descriptions.
Element |
Element
description |
Attributes |
Attribute
description |
LI |
list item |
|
|
MAT |
material (floating
element) |
MREF |
reference to the material |
NOTE |
note (floating element) |
|
|
RT |
refer to task |
RID,
PAGE |
reference to a task and to
a page number |
STID |
subtask identification |
ID |
identification of a
subtask |
STT |
subtask title |
LTR |
sequence indicator (A, B
etc) |
SUBTASK |
subtask |
|
|
TASK |
task |
|
|
TID |
task identification |
ID |
identification of a task |
TT |
task title |
|
|
WAR |
warning (floating element) |
|
|
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT TASK (TID, TT, SUBTASK+)>
<!ELEMENT TID EMPTY>
<!ATTLIST TID
ID CDATA #REQUIRED
>
<!ELEMENT TT (#PCDATA)>
<!ELEMENT SUBTASK (STID, STT, NOTE?, WAR?, LI+)>
<!ELEMENT STID EMPTY>
<!ATTLIST STID
ID CDATA #REQUIRED
>
<!ELEMENT STT (#PCDATA)>
<!ATTLIST STT
LTR CDATA #REQUIRED
>
<!ELEMENT NOTE (#PCDATA)>
<!ELEMENT WAR (#PCDATA | LI)*>
<!ELEMENT LI (#PCDATA | RT | MAT)*>
<!ELEMENT MAT EMPTY>
<!ATTLIST MAT
MREF CDATA #REQUIRED
>
<!ELEMENT RT EMPTY>
<!ATTLIST RT
RID CDATA #REQUIRED
PAGE CDATA #REQUIRED
>
<!DOCTYPE TASK SYSTEM "F.dtd">
<TASK>
<TID
ID="t52-43-01-000-800"/>
<TT>Remove Hatch Seals</TT>
<SUBTASK>
<STID ID="s52-43-01-010-001"/>
<STT LTR="A">Get access</STT>
<LI>Open the hatches and remove them <RT
RID="t52-43-00-000-800-0Ol" PAGE="401"/>.</LI>
<LI>Put the hatches down on a soft mat.</LI>
</SUBTASK>
<SUBTASK>
<STID ID="s52-43-01-020-001"/>
<STT LTR="B">Remove the hatch seals.</STT>
<NOTE>On each hatch there are 42 screws.</NOTE>
<LI>Remove the screws.</LI>
<LI>Remove the retainer strips.</LI>
</SUBTASK>
<SUBTASK>
<STID ID="s52-43-01-350-001"/>
<STT LTR="C">Remove corrosion</STT>
<LI>Remove any corrosion with abrasive cloth<MAT
MREF="FkO5-024"/>.</LI>
</SUBTASK>
<SUBTASK>
<STID ID="s52-43-01-110-001"/>
<STT LTR="D">Clean the hatch</STT>
<WAR>BE CAREFUL WHEN YOU USE THE METHYLETHYLKETONE (MEK). IF YOU
GET MEK IN YOUR EYES:
<LI>FLUSH YOUR EYES WITH WATER</LI>
<LI>GET MEDICAL HELP.</LI>
THE MEK IS POISONOUS.
</WAR>
<LI>Clean the areas with MEK<MAT
MREF="Fk11-003"/>.</LI>
</SUBTASK>
</TASK>
TASK ::=
'TASK'
{R:1:'<TASK>'}
whitespace
TID(I:ID)
{R:1:'<TID ID="' || ID ||
'"/>'}
whitespace
{R:1:'<TT>'}
TT
{R:1:'</TT>'}
( whitespace
{R:1:'<SUBTASK>'}
SUBTASK
{R:1:'</SUBTASK>'}
)+
( whitespace )?
{R:1:'</TASK>'} .
TID(O:ID)
::=
't' {ID = %}
Numbers_with_hyphens(I:ID1)
{ID = ID || ID1}.
TT
::=
PCDATA.
SUBTASK
::=
'SUBTASK'
separators
STID(I:ID)
{R:1:'<STID ID="' || ID ||
'"/>'}
whitespace
STT(I:LTR)
( whitespace
{R:1:'<NOTE>'}
NOTE
{R:1:'</NOTE>'}
)?
( whitespace
{R:1:'<WAR>'}
WAR
{R:1:'</WAR>'}
)?
( whitespace
STLI
)+.
STID(O:ID)
::=
's' {ID = %}
Numbers_with_hyphens(I:ID1)
{ID = ID || ID1}.
NOTE ::=
'NOTE'
( separators
no_hyphen_parenleft {R:1}
PCDATA_till_CR
)+.
WAR ::=
'WARNING'
( separators
no_hyphen_parenleft {R:1}
PCDATA_till_CR
|
WARLI
)+.
WARLI ::=
( separators )?
'-'
separators
{R:1:'<LI>'}
PCDATA_till_CR
{R:1:'</LI>'}.
STLI ::=
'('
number(I:N)
')'
separators
{R:1:'<LI>'}
MIXED_CONTENT
{R:1:'</LI>'}.
STT(O:LTR)
::=
capital {LTR=%}
('.')?
separators
{R:1:'<STT LTR="' || LTR ||
'">'}
PCDATA_till_CR
{R:1:'</STT>'}.
MIXED_CONTENT
::=
(
(
no_parenleft {R:1} )+
|
MAT
|
RT
)+.
MAT ::=
'('
('M' | 'm')
'aterial No. Fk'
Numbers_with_hyphens(I:mref)
')'
{R:1:'<MAT MREF="Fk' || mref ||
'"/>'} .
RT ::=
'(refer to TASK '
't'
Numbers_with_hyphens(I:rid)
' / page book '
number(I:page)
')'
{R:1:'<RT RID="t' || rid ||'"
PAGE=" ' || page || '"/>'} .
d ::=
[#x30-#x39]. /* name "d" as in XML Schema */
Numbers_with_hyphens(O:ID)
::=
number(I:N) {ID = ID || N}
( '-' {ID = ID || %}
number(I:N) {ID = ID || N}
)*.
number(O:N)
::=
( d {N = N || % })+.
PCDATA
::=
tokens
(
(
separators )?
CReturn
(
separators )?
tokens
)*.
PCDATA_till_CR
::=
tokens
( separators )?
CReturn.
tokens
::=
token
( separators {R:1:' '}
token
)*.
separators
::=
( separator )+ .
s
::=
separator
| CReturn.
/* name "s" as in XML Schema */
whitespace
::=
( s )+.
token
::=
( char {R:1})+.
separator
::=
' '
| #x09. /* space or TAB */
capital
::=
[#x41-#x5A].
char
::=
`(' ' | #x09 | #x0D). /* no space, TAB or
CR */
no_parenleft
::=
`('(' | #x0D). /* no '(' or CR */
no_hyphen_parenleft
::=
`(' ' | #x09 | '-' | '('). /* no space,
TAB, '-' or '(' */
CReturn
::= CR {R:1} .
A context sensitive rule (technically called a Chomsky type-1 or type-0 rule) has two or more symbols at the left-hand side. By preference, the rewrite sign between the two sides is "<::". The left-hand side of such a rule may never be empty. In context sensitive rules ordinary terminals, like 'a' or #x0E or '.', can be used at both sides. The two sides can also contain non-terminals and intermediates; their use will be explained later.
Context-sensitive grammars are mainly used for linguistic purposes. Therefore, we concentrate on these purposes. (More reading). A variant of context-sensitive grammars are transduction grammars, which we will discuss in the next chapter.
In
the next example, the input "in the garden the man sees a bird" as
well as "the man sees a bird in the garden" will be recognized as the
starting symbol "Sentence".
Sentence ::= Prep, Subj, Verb,
Obj .
/* a context-free rule */
Prep ::= 'in the garden ' .
Subj ::= 'the man ' .
Verb ::= 'sees ' .
Obj ::= 'a bird ' .
Prep, Subj, Verb, Obj <:: Subj, Verb,
Obj, Prep .
/* a context-sensitive rule */
In section 7.1.2 Transduction Grammars were introduced. In section 7.1.3 we discussed the applicability for XML up-conversion, e.g. when floating elements in the source documents have to be converted into mixed content. Example 5 will clarify this. We start with a short description of Transduction Grammars. More information is available in the Reference Manual.
Transduction grammars differ in many aspects of grammars used for recognition and/or parsing, while the syntax is nearly the same. Transduction grammars contain transduction rules which are essentially context-sensitive rules. However, transduction grammars do not have a first or starting symbol. A transduction rule can be interpreted as "the right-hand side (the input side), when recognized, will be replaced by the left-hand side (the output side)".
The left-hand side can be empty. In that case a pattern in the input matched by the right-hand side will be deleted. The right-hand side must contain at least one symbol (in a recognition grammar the right-hand side may be empty).
In a transduction grammar there is no need to describe the whole input as is the case in a grammar used for recognition and parsing. Only parts of the input may be covered by the rules. Input not described by any transduction rule will be simply emitted to the output.
It
is important to be aware of the fact that during run time, as soon as a
transduction rule reduces, the input covered by the right-hand side will be replaced by the symbols mentioned in the
left-hand side. The run time system resumes scanning the input with the first symbol of the newly inserted
ones, thus treating the replaced part as new input.
Examples:
/*1*/
'a' <:: 'a', 'a' . /* each
"aa" in the input will
be replaced by "a" */
/*2*/
'b', 'a' <:: 'a', 'b' . /* each
"ab" will be replaced by "ba" */
/*3*/ <:: 'd', 'd' . /* each
"dd" will be omitted */
A grammar consisting of only rule 1 will transform a large number of "a" 's into only one "a". The first "aa" will be replaced by one "a", which together with the next "a" forms a new pair, which again will be replaced by one "a", until finally the last "a" remains untouched. By the same mechanism, a grammar consisting of only rule 2 rewrites an input "aaabbb" to the output "bbbaaa" and an input "cabc" to the output "cbac". Finally, a grammar consisting of only rule 3 transforms an input "cddcd" into "ccd".
For a transduction grammar the compiler switch TRANSDUCE has to be set to TRUE. This has to be done in the “Advanced mode” of the gui. (The setting will be remembered when you return to the “Basic Mode”.)
There are situations in which you want a non-terminal on the left-hand side to stand for the input it has matched. This may be the case when you want to specify extra context in a rhs by one or more nonterminals, where the nonterminals may be described by context-free rules.
Using the cover sign (the
caret, "^"), you can mark a non-terminal on the left-hand side of a
context sensitive rule as being a cover non-terminal. Upon recognition of the
rule, the run time system will replace cover non-terminals with the terminal
symbols they have covered on the right-hand side.
(Note that it is an error to
use a cover symbol that occurs twice or more at the right-hand side. On the
other hand, it is allowed to use the same cover symbol more than once on the
left-hand side.)
In run time, cover
non-terminals behave as if they were substituted by the sequence of terminals
covered by the same symbol on the right-hand side.
Example 1:
'pq', 1_Ending^ <:: 'a', 1_Ending.
'r', 2_Ending^ <:: 'a', 2_Ending .
1_Ending ::= 'bc' | 'de' .
2_Ending ::= 'kl' .
The input ‘ade’ will be rewritten as ‘pqde’, the input ‘akl’ as ‘rkl’.
Example2:
H1^ <:: '(' H1 ')' .
H1 ::= ([0-9])+.
The transduction of ‘a(325)bc’ will be ‘a325bc’.
Intermediate symbols are
used in grammars with context sensitive rules or in a context free grammar that
is the last one of a cascaded grammar (see below). These symbols are called
"intermediate" because they are not terminal and not non-terminal,
but something in between. They differ from terminal symbols in that they do not occur in the
original input: they are introduced in the left-hand side of a rule. They
differ from non-terminal symbols in that they do not occur on the
left-hand side of a defining (context free) rule in the grammar.
Intermediate symbols are mainly of interest for linguistic grammars on the grapheme/phoneme level. Further reading.
Special attention must be
paid to ambiguous
transduction. This occurs if two (or more) context sensitive rules describe
(part of) the same input. For example:
/*1*/
'a', 'c', 'b' <:: 'a', 'b' .
/*2*/
'd', 'b' <:: 'b' .
If (part of) the input is "ab" then as soon as the run time system reads the "b", both rules are recognized. Rule 1 rewrites this input to "acb" and rule 2 to "adb" and the grammar creates ambiguous output. By default the switch ONE_ CS_ REDUCE will be set to TRUE for transduction grammars. The effect is:
- if two or more context sensitive rules are applicable at the same moment of reduction, the run time system chooses the one with the longest right-hand side (the number of terminals that have been covered by the rule) and
- if two rules have the same length on the right-hand side, the one that is mentioned first in the grammar will be chosen.
If ambiguous output is really wanted, the compiler switch ONE_ CS_ REDUCE=FALSE can be added to the switches file. In each case, the compiler generates a warning if ambiguous rewritings are to be expected.
In
designing a transduction grammar, care must be taken not to write transduction
rules that will bring the run time system in an endless loop. For example, the following innocent
looking rule will cause such a looping problem:
The input of only one "a" will be replaced by two "a" 's, each of which in turn will be replaced by two "a" 's, etc., until the program crashes due to a shortage of memory.
In transduction grammars the negation of a single character can be extended towards the negation of a string, of a regular expression and of a nonterminal. More reading on negations.
Some document schema’s are rather liberal in describing the position of elements, as if they are floating within mixed content. In that case pattern matching becomes a viable option in order to locate the start and end of the floating elements.
This example demonstrates preprocessing by a transduction grammar of
the source document of example
In the DTD of ex. 4 the elements Mat and RT were located within mixed content. We assumed that the elements Mat and RT always start with a “(“ and that the PCDATA before Mat and RT does not contain “(“’s. This assumption avoids the problem of ambiguity when “(“ may be part of PCDATA ánd will initiate the start of Mat and RT. Note and Warning had a rather fixed position. LI elements always started on a new line.
In this example we assume that the PCDATA before Mat and RT may contain “(“’s and that the positions of Note, Warning and LI are flexible.
The target document of Ex_5 will have the extension “tdo”.
The preprocessing concerns the following:
- to find the start of the elements Note, Warning and LI; their position will be marked in Ex_5.tdo by the character #x80 (printed as a “€”) which does not occur in the source document (any such character may be taken). This marker will do the disambiguation, instead of the “(“. In order to find the start of the elements all relevant context may be specified in the transduction rules.
- to do an immediate markup of the elements Mat and RT.
The next grammar will use the special marker in order to locate the positions for the starttags and the endtags of Note, Warning and LI. The markup of Mat and RT will remain untouched, as being part of a stream of #PCDATA.
This example (and the next one) serves to familiarize oneself with transduction grammars and to get some feeling of what to delegate to a possible preprocessing phase and what to the recognition phase.
In general, preprocessing is only required for the precise handling of floating elements, where ambiguities in handling PCDATA have to be avoided.
'NOT' #x80 'E' <:: 'NOTE'.
'WARNIN' #x80 'G' <:: 'WARNING'.
'(' H1^ #x80 ')' <::
'('
H1
')'
.
'<MAT MREF="', Matno^, '"/>'
<::
'('
('M' | 'm')
'aterial No. '
Matno
')' .
Matno ::=
'Fk'
nums
'-'
nums
.
'<RT RID="'
TID^
'"
PAGE="'
H1^
'"/>'
<::
'(refer
to TASK '
TID
' /
page book '
H1
')'
.
TID ::=
't'
nums
(
'-'
nums
)*.
H1 ::=
nums
.
num ::=
[#x30-#x39].
nums ::=
(
num )+.
H2^
#x80
' '
<::
H2
separators.
H2 ::=
CR
(
separators )*
'-'.
separator ::=
' '
|
#x09. /* space or TAB */
separators ::=
(
separator )+ .
TASK t52-43-01-000-800
Remove Hatch Seals
SUBTASK
s52-43-01-010-001-001
A. Get
access
(1€) Open the hatches and remove them <RT
RID="t52-43-00-000-800" PAGE="401"/>.
(2€) Put the hatches down on a soft mat.
SUBTASK s52-43-01-020-001-001
B Remove
the hatch seals.
NOT€E
On each hatch there are 42 screws.
(1€) Remove the screws.
(2€) Remove
the retainer strips.
SUBTASK
s52-43-01-350-001
C. Remove
corrosion
(1€) Remove any corrosion with abrasive cloth
<MAT MREF="Fk05-024"/>.
SUBTASK
s52-43-01-110-001
D. Clean
the hatch
WARNIN€G
BE CAREFUL WHEN YOU USE THE
METHYLETHYLKETONE
(MEK). IF YOU GET MEK IN YOUR EYES:
-€
FLUSH YOUR EYES WITH WATER
-€
GET MEDICAL HELP.
THE
MEK IS POISONOUS.
(1€) Clean the areas with MEK <MAT
MREF="Fk11-003"/>.
Use directory Ex_6. Copy grammar Ex_4.grm into Ex_6.grm. Adjust Ex_6.grm in such a way that Ex_5.tdo will be the source document. The target document Ex_5.xml has to be identical to Ex_4.xml.
More solutions are possible. One possible solution is contained in Ex_6_possible_solution.grm.
It is possible to split a grammar into two (or more) grammars, and to
link these grammars together into a so-called cascaded
grammar.
The run time system will treat the input initially with the first grammar and
will use the resulting output as input for the second grammar, and so on. The
linking can be performed by the program Plink . All grammars have to be
transduction grammars, except the last one. The last grammar of a cascaded
grammar can be a transducer or a recognizer/parser. In the latter case, the
recognizer may perform a check on the rewriting of the former grammars. Of
course, such a configuration does not result in a file with the rewritten input
(the file with extension .TDO) but an output as requested
by the last grammar; rejecting/accepting the transformed input in case of a
recognizer, a parse tree in case of a parser and eventual output written by
output instructions.
A cascaded grammar can consist of a maximum of 25
individually compiled grammars.
Remark: The runtime system creates “pipes” between
cascaded grammars. That means that as soon as an output symbol becomes
available in one cascaded grammar it will be transfered as input to the next
cascaded grammar. In that manner the on-line property of the runtime system is
maintained.
Suppose each symbol ‘a’ in the input has to be replaced by ‘aa’. As we have seen earlier, the following innocent looking rule will cause a looping problem:
'a', 'a' <:: 'a' .
The input of only one "a" will be replaced by two "a" 's, each of which in turn will be replaced by two "a" 's, etc., until the program crashes due to a shortage of memory.
The problem can be solved in the following way:
Grammar1:
'a', treated, 'a', treated, End <:: 'a', End .
End ::= 'a' | cr .
Grammar2:
<:: treated . /*
This transduction rule has an empty left-hand side */
After linking to a cascaded grammar, the run time system will transform
the input without the superfluous intermediate symbol "treated".
Directory Ex_7 contains the compiled transduction grammar Ex_5_td.pg and the compiled recognition grammar Ex_6_possible_solution.pg.
- Be sure that in the Basic Mode of the GUI at “Report file transformation to XML” is selected “Strip first report number”.
- Press “Advanced mode…”
- Press the tab “Cascade linker”.
- In the box “Select Grammar” select Ex_5_td.pg
- Press “Add to List”
- In the box “Select Grammar” select Ex_6_possible_solution.pg
- Press “Add to List”
- If you want to change the sequence of a compiled grammar then first select it and then use the buttons “Move Up” and/or “Move Down”. Eventually, you may remove a grammar with de button “Remove”.
- In the box “Output to grammar” select the name of an already existing compiled grammar or type a new name, in this case Ex_7.pg.
- Press the button “Link grammars”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.
- Close the Process Monitor.
- The current combination of filenames is automatically saved in the file “Ex_7.sam”. Later you may load this file with the button “”Load sam-file…”. All filenames will then be restored.
- Press the tab “Parser”
- Press “Select inputfile…” and select the input file “Ex_4.txt”.
- Press “Run grammar…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.
- Close the Process Monitor.
- Inspect the output file Ex_4.xml.
- You are done.
Lexicons are mainly used in linguistic grammars (see example 9). However, they can also be used for more mundane tasks like the enumeration of members of a closed set. That is the topic of the following example 8.
Please read first the section in the Reference Manual on lexicons.
When
lexicon symbols are used in a grammar, the run time system must use a lexicon. Read here for the construction
of a lexicon.
In the dtd of our running example (ex4.dtd) there are two kind of identifiers, TID for “task identification” and STID for “subtask identification”. They are defined as CDATA, which is very general. Suppose we want to do a rigorous check on the value of the identifiers. We could enumerate them in the grammar ex4.grm, but this will be impractical in the case of a large set of values. Moreover, any time the set changes we have to recompile the grammar.
For the sake of demonstration we will construct a lexicon with all the allowed values of TID and STID (example 8a) and will adjust the grammar in order to use that lexicon (example 8b).
In LingDoc Transform a lexicon is maintained as a text file. It is compiled into a compiled lexicon that can be used by the runtime system. The compiled lexicon uses the datastructure of a trie on external memory in order to support the on-line property of the runtime system. The lexicon compiler may also add new entries to the compiled lexicon.
This mechanism for maintenance is rather crude. A more sophisticated tool for lexicon maintenance is Lexbench, part of LingDoc Clarity. It uses a database for the storage of the lexicon. If you want to use one of the lexicons of Clarity in Transform than you may export the contents of that lexicon as a text file in order to compile it as a lexicon for Transform.
At this moment the syntax of a lexicon in Transform is slightly different from the syntax of an exported lexicon of LexBench. In the near future they will become the same. In the mean time you may write a small conversion script by your own or obtain one from Palstar (of course written as a Transform grammar).
For our running example we define the grammar symbol $number_lex as lexicon symbol for the values of the attributes ID of element TID, ID of element STID and MREF of element MAT. For the sake of demonstration we list in the lexicon all values of the attributes that appear in the document, together with an attribute which indicates the type of the value (also for the sake of demonstration). The lexicon is contained in file Ex_8\lex\ex_8_lex.txt:
$number_lex('52-43-01-000-800','t')
$number_lex('52-43-00-000-800','t')
$number_lex('52-43-01-010-001','s')
$number_lex('52-43-01-020-001','s')
$number_lex('52-43-01-110-001','s')
$number_lex('52-43-01-350-001','s')
$number_lex('05-024','m')
$number_lex('11-003','m')
- In the GUI press “Advanced mode…”.
- Press the tab “Lexicon compiler”.
- For “lexicon source” select Ex_8\lex\ ex_8_lex.txt.
- For “compiled lexicon file” select Ex_8\lex\ex_8_lex.ltr.
- Note: if this is an existing compiled lexicon (with extension “.ltr) then the entries in ex_8_lex.txt will be added. If you want to start with an empty compiled lexicon than you have to delete the existing compiled lexicon beforehand. If the compiled lexicon does not exits you will have to navigate to the correct directory (here: Ex_8\lex) and to type in the name by yourself, with or without the extension “.ltr”.
- Press the button “Compile lexicon…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.
- Close the Process Monitor.
- You are done.
The source document is Ex_8\doc\ex4.txt (which is the unmodified source document of Ex_4).
The grammar is Ex_8\gram\Ex_8.grm, wich is Ex_4.grm, but partly modified in order to accommodate the lexicon symbols. The changes are:
--old:
TID(O:ID)
::=
't' {ID = %} number(I:N) {ID = ID||N}
( '-' {ID = ID||%}
number(I:N) {ID = ID||N}
)*.
--new:
TID(O:ID)
::=
't' {ID = %}
Numbers_with_hyphens(I:st, I:ID1)
{ID
== st; ID = ID || ID1}.
--old:
RT ::=
'(refer
to TASK '
't'
Numbers_with_hyphens(I:rid)
' /
page book '
number(I:page)
')'
{R:1:'<RT
RID="t' || rid ||'" PAGE=" ' || page || '"/>'} .
--new:
RT ::=
'(refer to TASK '
't'
Numbers_with_hyphens(I:st, I:rid)
{st == 't'}
' / page book '
number(I:page)
')'
{R:1:'<RT
RID="t' || rid ||'" PAGE=" ' || page || '"/>'} .
--old:
Numbers_with_hyphens(O:ID) ::=
number(I:N)
{ID=ID||N}
(
'-' {ID=ID||%}
number(I:N) {ID=ID||N}
)*.
--new:
Numbers_with_hyphens(O:st,
O:ID) ::=
$number_lex(ID,
st).
- In the GUI, in the Basic Mode, select as grammar Ex_8\gram\Ex_8.grm.
- Compile the grammar.
- Press the button “Advanced Mode”, press the tab Parser.
- Select as inputfile Ex_8\doc\Ex_4.txt.
- Select (at the left side of the pane) as lexicon: Ex_8\lex\ex_8_lex.ltr.
- Select (in the middle of the pane) the checkbox “lex_increment”.
- Press the button “Run grammar…”. The Process Monitor will show up. If the exit status is 0 then everything went well. Else you have to study the error messages.
- Close the Process Monitor.
- You are done. The output is in Ex_8\doc\Ex_4.xml.
In computational linguistics it is customary to study a natural language sentence on its own, stripped of markup and material that is only meant for presentation. A grammar covers one sentence. This is a linguistic approach. The handling of characters for structure and presentation is delegated to the lexical scanner.
In the exercises up till now, a grammar in Transform covered a complete document in which natural language is embedded. This is a document oriented approach. In an XML document the natural language is contained in the PCDATA and is not furher marked up.
In Transform, a number of switches has been introduced in order to accommodate either a pure document oriented approach, a pure linguistic approach or a combination of both.
The document oriented case we have met up till now. The linguistic oriented case will be dealt with in this example 9. The combination of both will be met in example 10.
The input in this example (in file doc\ex9.txt) has been constructed by the former grammars. We will describe and parse the sentences which are contained in the PCDATA.
In this example the grammar covers one sentence. Blanks (spaces and tabs) between words are ignored and do not have to be described by the grammar.
The following switches are of interest. Check if they are set in the tab Parser. The GUI writes the values of the switches to gram\ex_9.sw.
single_lines=true -- each natural language sentence is expected to be on a single line, delimited by an end-of-line character. The grammar concerns one complete sentence. The parser will process all sentences in the source document.
ignore_blanks=true -- blanks (spaces and tabs) are ignored.
build_parsetree=true -- parsetrees are constructed, in file ex_9.out.
lex_increment=false -- the scanner reads a complete word before it is looked up in the lexicon.
lex_chars=a..z | A..Z | 0..9 -- these characters may appear in the lexicon.
Remove Hatch Seals.
Get access.
Open the hatches and remove them.
Put the hatches down on a soft mat.
Remove the hatch seals.
On each hatch there are 42 screws.
Remove the screws.
Remove the retainer strips.
Remove corrosion.
Remove any corrosion with abrasive cloth.
Clean the hatch.
Be careful when you use the methylethylketone.
If you get mek in your eyes.
Flush your eyes with water.
Get medical help.
The mek is poisonous.
Clean the areas with mek.
SENTENCE.
SENTENCE ::=
IMP_S
(ADV_SUB_F)?
EOS
|
ADV_SUB_F
(PP_S)?
EOS
|
SVO
EOS.
IMP_S ::=
$v_np()
NP(dummy)
($adv_a())?
($part())?
(PP_S)?
|
$v_copi()
$adjpre().
SVO ::=
NP(dummy)
$v_np()
$adjpre()
|
(PP_S)?
$there()
$v_copi()
NP(dummy)
|
$pers()
$v_np()
NP(dummy).
PP_S ::=
$prep_s()
($v_np())?
NP(dummy).
NP(O:num) ::=
(DETP)?
NOUN(num)
|
QUAN(num)
NOUN(num1) {num==num1}.
DETP ::=
$artdef() | $art_i() | $predet().
NOUN(O:num) ::=
$noun(dummy1,num1)
$noun(dummy2,num)
|
$dem(dummy,num)
|
($adjpre())?
$noun(dummy,num).
QUAN(O:num) ::=
$quan(dummy,num)
|
$card(dummy,num).
ADV_SUB_F ::=
$sconj()
SVO.
EOS ::= ('.' | ':' | ';').
$adjpre('abrasive')
$adjpre('careful')
$adjpre('clean')
$adjpre('medical')
$adjpre('necessary')
$adjpre('open')
$adjpre('poisonous')
$adjpre('soft')
$adjpre('your')
$adv_a('if necessary')
$artdef('the')
$art_i('a')
$card('42','p')
$dem('them','p')
$noun('access','s')
$noun('areas','p')
$noun('cloth','s')
$noun('corrosion','s')
$noun('eyes','p')
$noun('hatch','s')
$noun('hatches','p')