5. LingDoc and Parsing and Transduction Technology: Position Paper

 

1         Computational layers

For each linguistic layer there may be several computational layers. Their advantage is that the overall usability of a system is improved.

The total complexity decreases because of the principle “divide and conquer”. The understandability increases because each layer is studied on its own. Extensions are made easier and maintenance becomes easier.

o      Requirements

§       There shall be an efficient handling of ambiguities at all levels.

§       There shall be an integrated error handling throughout all levels, with correct explanations at the user level.

o      Current solutions

In current practice of document and language engineering the following computational layers may be in use:

§       A workbench for  the entry and maintenance of specifications in a special formalism for a specific type of application, with eventually an automatic translation onto a grammatical description.

§       A workbench for  the entry and maintenance of grammatical descriptions written in a grammatical formalism.

§       An interpreter for the grammatical formalism or a program generator (which may be called a compiler) which translates the grammatical description into code for a formal automaton. The last approach is not mentioned by Jurafsky.

§       An interpreter for the automaton, or a compiler which translates the code for a formal automaton into a programming language or directly into computer instructions.

§       The resulting program which is an interpreter, a parser or a transducer for a document/text.

§       Principles, theory and practice of computer science can be applied for the interpreters and compilers involved.

o      Position of Transform and Clarity

The computational layers are:

§       A workbench for  the entry and maintenance of grammatical descriptions written in a grammatical formalism.

§       A compiler which translates the grammatical description into code for a formal automaton

§       An interpreter for the formal automaton.

§       The resulting program which is an interpreter, a parser or a transducer for a document/text. This program may be embedded in another process, like an authors workbench.

2         Recognizer, parser, transducer

Recognizers, parsers and transducers are programs that execute the grammatical specification for each layer. They read the input and generate output.

§       A recognizer is a process which takes as input a string and either accepts or rejects it de­pending on whether or not the string is a sentence that may be produced by the syntactic description.

§       A recognition process outputs, in principle, "yes" or "no" .

§       A parser is a recognizer which also outputs one or more structural descriptions (“parse”) of the sentence according to the syntactic description.

§       A transducer is a recognizer or parser which emits output-symbols. We distinguish two situations. The first one concerns a recognition process which generates a message when some point in the grammar is reached. The second one concerns a process where output-symbols are emitted.

o      Position of Transform

§       Transform can act as a recognizer, a parser or a transducer.

§       The output may contain: structural descriptions (with values of variables), reports, the variables which are associ­ated with the start-symbol and transformed text.

o      Position of Capri

§       Capri can act as a recognizer, a parser or a transducer.

§       The output after each sentence may contain: a message, structural descriptions (with values of variables), transformed/translated sentence.

3         Off-line, on-line, real-time

Processes are embedded within an environment which can impose time-constraints. The following definitions are relevant.

A process is said to be :

§       Off-line when a response about rejection or acceptation is given after processing the whole input. This assumes that the output is of limited length. E.g., the strategy of backtracking forces a process to be off-line.

§       On-line when there is a response after each symbol of the input, indicating possible contin­uation or rejection. The input is potentially of unlimited length. In the case of a transducer the output is sometimes called “streaming output”.

§       Real-time; when the on-line response will come within a guaranteed time. The term indicates the coop­eration with other processes where this guarantee is imperative; usually, rejection will not occur and the input is potentially of unlimited length. In principle there is no upper bound for the guaranteed time. In a more colloquial definition of real-time the delay is, for human experience, ignorable.

o      Requirements

§       Embedding in continuous processes requires the on-line property.

§       Cooperating processes will need a signal when a specific position in the input is reached.

§       If a partial semantic representation has to be built during on-line processing, then evaluated expressions with variables have to be emitted after each input symbol.

o      Position of Transform

§       Transform has the on-line (streaming) property. It emits output when the process is in a deterministic state.

o      Position of Capri

§       Capri has the off-line property for a complete sentence. It emits output when a sentence is completed.

4         Indeterminism

During recognition, parsing and transduction indeterminism may arise.

With on-line recognition of ambiguous grammars there are 3 possibilities of proceeding after a new symbol is read in:

§       no proceeding possible (and an error-recovery scheme may have to be invoked)

§       there is exactly one possibility of proceeding (then the process is said to be deterministic)

§       there are two or more possible ways of proceeding (the non-deterministic case).

o      Requirements

§       In linguistic processes all forms of ambiguities amongst all linguistic layers must be handled.

§       Subsequently, in up-conversion processes and in author workbenches multiple choices have to be presented to the user.

o      Current solutions

§       Temporal ambiguities may be resolved by lookahead.

§       Inherent ambiguities may be resolved by

-    Explicit backtracking ("depth first", provided in some interpreters of formalisms and in some program­ming lan­guages (like Icon, Prolog etc.).

-    Parallel parsing; where all possible paths are tried in parallel ("breadth first") and common sub-paths are shared. This is attractive for the parsing of context-free grammars, because of the possibility of cubic processing time, e.g. in a parse forest with Tomita style parsing (GLR), or in the (interpretative) algorithm of Earley.

o      Problems with current solutions

§       Temporal ambiguities are often introduced by inexperienced users who try to write an initial grammar. In that case a constraint to non-ambiguous grammars is frustrating.

§       Inherent ambiguities mostly have semantic reasons. They may be minimized by maximizing the use of syntactic clues in the input, making use of lookahead. A grammar formalism together with the indefinite lookahead property provides for a context of indefinite size, to the left and to the right. This is in contrast with event-based formalisms.

§       In many implementations of grammatical systems the handling of ambiguity is poor. Most implementations work with chronological backtracking techniques, which operate off-line. This also leads to exponential runtime behavior.

o      Position of Transform

§       In the design of LingDoc Transform, special care has been paid to handle ambiguity in an efficient way.

§       Parallel parsing uses recombination of paths; this accounts for the on-line property. There is no backtracking.

§       Also, attributes and expressions are evaluated on-line.

-    There is error repair when no proceeding is possible.

-    If one analysis path remains, the output, which is stored in the remaining parse-forest, is streamed.

o      Position of Capri

§       For each sentence there is parallel parsing according to the Tomita algorithm. After the analysis is finished attributes are unified, expressions are evaluated and output is emitted.

5         Interpreters and compilers

o      Interpreters

§       Description of issue

An interpreter for a grammatical formalism calculates after each input symbol what will be the next state of the process.

During the development of a grammar the use of an interpreter may be advantagous because the grammar need not to be compiled.

§       Requirements

For efficiency the formalism may be pre-processed.

§       Current solutions

A well-known example for the parsing of context-free languages is the dynamic programming algorithm of Earley. This algorithm has been extended for a number of extensions of context-free grammars (see also Jurafsky).

§       Position of Transform

As an option, Transform may perform parsing for ecfg’s with Earley’s algorithm. The pre-processing is more or less the same as for LR-parser generation. The idea is that for testing grammars an interpreter may be faster than compilation with subsequent parsing.

Accordingly, in the course of time, when additions were made to the ecfg formalism of Transform, Earley’s algorithm has been extended. However, there has been a delay in implementation. Chapter 6 of the thesis describes which pre-processing can be performed for the full formalism.

o      Compilers

§       Description of issue

In general: a compiler transforms a specification written in a formalism into an executable program. This may be machine code or intermediate code which will be interpreted.

After the development of a grammar the use of a compiler may be advantagous because the run time will drastically improve.

A more or less equivalent term for a compiler is a program generator; if the program is a parser then the compiler may be called a parser generator.

A parser generator takes a formal description of a grammar (e.g. in BNF) and outputs code for a parser. The parser will recognize valid strings according to the grammar and will perform actions associated with grammar rules. In generating the code the compiler describes every possible so-called state of the parser during the treatment of the input. This state is only dependent on the section of the input already treated.

§       Position of Transform and Capri

Both have a compiler.

See further the separate section on program generation.

6         Automata

In Computer Science a correspondence is maintained of grammatical formalisms and formal automata.

A formal automaton consists of a number of states and of set of data structures. The states are pre-calculated by a parser generator. The data structures are updated during the processing of the input. Each state is associated with a unique situation. Upon reading a symbol a transition is made from one state to another, together with a possible update of the data structures and an emittance of output.

The complexity of time and space of recognition, parsing and transduction can be studied more easily with the aid of automata.

The automata which correspond to ecfg’s operate on-line.

o      Requirements

§       Automata shall be able to handle non-determinism.

§       Automata shall run with the lowest complexity possible.

o      Current solutions

§       Jurafsky e.a. discuss a number of solutions. However, they do not discuss solutions for non-deterministic breadth-first parsing where a parse forest is maintained, like in the algorithms of Tomita (GLR).

o      Position of  Transform

§       The development of Transform started with a solution comparable with Tomita’s solution. (This development occurred independently.) The extensions of the formalism are reflected in extensions to the formal automaton. The result was the so-called Parallel Transduction Automaton (PTA), described in the thesis.

§       Upon reading the next symbol a number of instructions are executed which operate on two dag-structured stacks and on the parse forest. The instructions are calculated by the compiler.

§       The developed run time system is an interpreter for the PTA. The run time system tries to follow all possible paths in parallel (breadth first). If there is more then one path, common parts of the resulting parse trees will be shared as much as possible in a “parse forest”.

§       There is a “deep binding” of variables and output in the parse forest.

§       In the case of a transduction grammar pieces of a text will be transformed. If no more rules are applicable the transformed text is emitted.

§       Multiple grammars may be cascaded. Output from one PTA is then submitted to another PTA, as soon as output becomes available.

o      Position of Capri

§       Capri follows the solution of Tomita. After parsing a sentence all resulting parsetrees are reconstructed. For each parsetree the variables and expressions are evaluated as if there was one parsetree.

7         Lexical scanner, lexicon and parser

Commonly, the lexical scanner, the lexicon and the parser operate in different linguistic layers.

o      Requirements

§       The lexicon shall provide for

-    Multiwords

-    Idiomatic expressions.

§       The lexical scanner has to identify tokens for the parser/transducer, markup of the document and strings of separators. In the output of a transducer the markup and strings of separators have to be re-inserted in the right place.

o      Problems with current solutions, according to the usability criteria

Conceptually, the lexicon is part of a syntactic description, but in practice it plays a separate role. Most implementations make use of an external data structure for which it is necessary to know the whole en­try before it can be looked up. This may cause troubles with multiwords and words that appear in the grammar itself.

o      Position of Transform

§       The lexical scanner handles symbols on the character- and on the word-level.

§       Words (in general strings) may be described in the lexicon as well as in the grammar.

§       The lexicon is essentially treated as an extension of the grammar so that ambiguities on the lexical level are dealt with in parallel with ambiguities on the syntactic level.

§       The access to the lexicon runs in parallel with the parsing process and is transparent for ambiguities, multiwords and words which appear in the grammar.

§       The lexicon is implemented as an external “trie” data structure. Because of this, two valuable properties result:

-    the on-line property is maintained at the character level

-    the access-time of the lexicon is independent of the size of the lexicon.

o      Position of Capri

§       In Capri a word is the smallest lexical unit. It has to appear in the lexicon.

§       The lexicon resides in a database. It can handle ambiguities and multiwords. The handling of ambiguities in the lexicon and in the grammar is transparent.

§       The lexical scanner distinguishes words and strings of separators. A string of separators may contain XML markup. During translation the XML markup is kept with the lexical units with which they are associated.

§       Rules for sandhi in the translated text are, up till now, programmed for each target language.

8         Robustness

A typical aspect of practical Document and Language Engineering is integrated error handling among all linguistic layers.

Errors arise from ill-formed input and from wrong or inconsistent grammatical specifications.

o      Requirements

§       Error handling shall be robust.

§       Errors have to be detected as soon as possible.

§       Errors have to be "repaired", as correctly as pos­sible, in order to continue the pro­cess.

§       All errors have to be reported, with an exact positioning and with suggestions for correction.

o      Position of Transform

§       Errors are repaired as long as possible and as correctly as pos­sible, in order to continue the pro­cess.

§       The grammar writer may define a set of “recovery symbols”. If an error is encountered the process continues along all paths that are available until a recovery symbol is met. There is a strategy to determine which input symbols have to be skipped and which recovery paths in the dag-structured stack and parse forest have to be pruned.

o      Position of Capri

§       Capri works with grammar rules in which grammatical errors are specified, together with their corrections.

1         Workbenches

Linguists should be provided with a workbench for the development of lingware.

o      Requirements

§       There have to be facilities for tracing and debugging of grammars.

o      Position of Transform and Capri

§       Detailed tracing of the parsing/transduction process is possible. However, the LR parsing processing works bottom-up and is therefore difficult to follow.

§       More visual tools for the grammar writer are needed.

2         Embedding

A process for recognition, parsing or transduction may be embedded within other processes.

o      Requirements

§       The embedding has to follow standard procedures for cooperating processes.

o      Position of Transform

§       There are no specific provisions for embedding. However, Transform is programmed as a multitasking system.

o      Position of Clarity: Capri, Lexbench, Author

§       Capri is the engine which performs the analysis, correction and translation; it runs in the background of Author or as a standalone program.

§       Author is the generic name for the interface between Capri and an external authoring system.

§       Author assists an author during normal editing tasks with correction and standardization of spelling, terminology, grammar and style. If errors are encountered, suggestions for corrections are given to the author.

§       Author has been implemented in three versions:

-    As an interface to LEditor, a standalone editor, developed internally by Lingware services. This was the first version of Author with a tight integration of editing functions, functions of Capri, functions for inspecting and updating the lexicon and functions for the workflow of authoring documents in controlled language. LEditor has been productive in a number of projects. The experience with it suggested a more formal approach for the development of an independent version of Author with an own identity.

-    FmEdit, an add-in for Framemaker, was the second version of Author.

-    WoEdit, an add-in for Microsoft Word, was the third version of Author.

-    A still more generalized version of Author has been envisaged but has not yet been implemented.

3         Compiler/program generator/parser generator

The theory of parser generation stems from computer science. LL (bottom-up) and LR (top-down) parsing have drawn a lot of attention because they cover most types of context-free grammars, while indeterminism of the parser may be resolved by lookahead. Parser generation involves a direct mapping of a grammar onto a stack automaton.

Variants of LR parsing are LALR and SLR. LALR(1) means: LALR parsing with one lookahead symbol. SLR(k) means: SLR parsing with an indefinite number of lookahead symbols.

o      Requirements

§       The extensions to ecfg’s according to Position Paper 4 shall also be handled, including ambiguous grammars.

§       The grammatical formalism shall be translated directly into code for a formal automaton.

§       Errors shall be handled.

§       The theoretical lower bounds for time and space shall be approached.

o      Current solutions

§       Examples of open source parser generators are

-    Yacc (LALR(1))

-    Cup (LALR(k))

-    Antlr (LL(k))

-    Precc (LL(k)), where the formalism is EBNF with attributes.

o      Problems with current solutions

§       there is no ambiguity allowed

§       there are no extensions to cfg’s

§       difficulty of debugging.

o      Position of Transform

§       In considering the choice of parser strategy and number of lookahead symbols the following considerations were made

-    The LR method accommodates more types of grammar than the LL method.

-    The larger the number of lookahead symbols, the more time is needed for parser generation. Therefore, if ambiguous grammars are accommodated then there is less need for a large number of lookahead symbols. SLR has been chosen instead of LALR because it generates less states than LALR and subsequently less code is generated.

§       Extensions to LR parser theory have been made for the formalism of Position Paper 4.

§       Optimizations are built in for code generation for the lowest type of formal automaton possible. E.g.: grammars without middle recursion are mapped onto a finite state automaton. Empty or unit rules are eliminated.

o      Position of Capri

§       an ECFG is first translated into a CFG

§       variables are handled separately from the CFG

§       SLR(1) parser generation is used

§       parsing is done by a stack automaton with a parse forest (Tomita style).

4         Complexity of time and space

Typical stages of the evolution of algorithms and programs are

§       the resolution of undecidability problems

§       the development of naive algorithms (usually exponential in running time or NP-complete)

§       the improvement in speed to polynomial or linear time algorithms

§       the improvement of the constant factors

§       the optimizations within the program, not dealing with the algorithm, and

§       the adaptations in the hardware of the machine (e.g. development of specialized micro code).

o      Requirements

§       Complexity of time and space has to be as small as possible; the theoretical lower bounds for time and space shall be approached.

o      Current solutions

§       There is ongoing research on the complexity of algorithms; in general the strategy is to find sub-classes of classes of problems for which the complexity can be improved. For the complexity of time this may concern:

-    From NP to P

-    From P to linear time

-    From linear to sub-linear time.

o      Position of Transform

§       In Transform a number of improvements are implemented. These are described in the thesis. There it is shown

-    how to parse on-line in exponential time ambiguous type-0 grammars and transduction grammars, en­riched with variables and with an on-line construction of the parses, with the aid of the PTA

-    how to parse on-line in polynomial time, with an on-line construction of the parse trees, ex­tended con­text-free grammars with don't cares

-    how to reformulate the problem of pattern matching of Aho and Corasick into the problem of LR-parsing, with the same linear time behavior

-    how to extend the LR parser generation method for the treatment of grammars with Boolean expressions for the treatment of term-rewriting systems in such a way that pattern matching with a number of keywords, containing regular expressions and don't cares, can be implemented to run in sub-linear time

-    how to generate in all these cases the minimum of code for the PTA in order to use only those parts which are necessary

-    how to implement the PTA on parallel hardware.

o      Position of Clarity: Capri, Lexbench, Author

§       No special improvements of complexity have been implemented. In the runtime system the algorithm of Tomita is implemented.

5         To be done

o      Transform: a number of sub formalisms have been disabled temporarily. They have to be activated again. They concern:

§       Interpretation according to Earley.

§       “Cooperation’s” between (pattern-)grammar rules (not mentioned before; this concerns the synchronization of patterns, as an extension to Boolean “and”).

§       Tree pattern matching (its maintenance was postponed because Xquery became available).

§       Booleans of nonterminals.

o      Transform and Capri: probabilities of grammar rules.

o      Transform: full unification by on-line lazy evaluation.

o      Capri: from explicit to implicit unification (a matter of pre-processing).

o      Transform and Capri: easier facilities for debugging grammars.

6         Literature

o      Van der Steen, G.J., A program generator for recognition, parsing and transduction with syntactic patterns. CWI Tract 55, Centre for Mathematics and Computer Science, P.O. Box 4079, 1009 AB  Amsterdam, The Netherlands, 284 pp, 1988, ISBN 9061963613.

o      Van der Steen, G.J., “Unrestricted on-line parsing and transduction with graph structured stacks”. In: Lankhorst, M. et all (eds.), “Tomita’s Algorithm - Extensions and Applications”, Twente Workshop on Language Technology 1 (TWLT1), University of Twente, 1991, pp. 36-78.

o      User Manual Transform.

o      Reference Manual Transform.