3.
LingDoc and Lingware Management for Controlled Languages and MT: Position Paper
In this position paper we describe the use
of LingDoc for the handling of controlled languages and for their subsquent
translation.
1
Problems with Language Engineering
In the Position Paper “LingDoc and
Applications of Language and Document Engineering” the related disciplines of
Language Engineering were defined. Each of these disciplines is in a stage of
development.
For Linguistics, there is a constant
movement in linguistic theories, and change in formalisms. A formal theory of
translation is needed.
Computational Linguistics makes linguistic
theory computational, but if there is no sufficient theory the solutions tend
to become ad-hoc. There is a necessity to formalize the meaning of words,
sentences and discourse. Accurate translation requires an understanding of the
context, as well as an understanding of the structure and rules of the
language.
In Natural Language Processing (NLP) there
is a problem with the complexity of unrestricted language. Vocabularies are
vast, with many ambiguities. There is no natural language for which a complete grammar
exists.
The worst case scenario of NLP is found in
systems for Machine Translation (MT). Most practical MT systems require human
pre- and/or post-editing. This considerably reduces the cost-effectiveness of
such systems. Moreover, input texts are frequently ill-formed. Therefore,
unrestricted MT is considered not possible with a grammar based approach.
The alternatives are either to follow a
statistical approach with machine-learned rules or to use a restricted
language. The first approach is still in a stage of development. We concentrate
on the second one.
2
The rationale of Controlled Language
For Language Engineering the use of
restricted, or controlled, language is a viable approach. The terms
“restricted” and “controlled” are used more or less interchangeably.
An
example is the correction of the following sentence:
In general, the restriction may concern
readability and/or translatability. This corresponds with the distinction
between monolingual and multilingual applications.
The restriction may concern only the lexicon. An example is “AECMA
Simplified English” which the aviation industry has decided to introduce as the
standard for their maintenance manuals. On the use of grammar style only an
advice is given to authors. On the other hand, the restriction may also concern
the formal grammar. For instance, one of the first fully automatic MT systems
that worked correctly was a system for weather forecasts in Canada, called Meteo. The language
was heavily restricted, made possible by the restricted domain.
In short, the advantages of controlled
languages, compared with unrestricted natural language, are:
§
The vocabulary becomes smaller.
§
The size of the grammar is
reduced.
§
Ambiguities are eliminated.
§
There is unequivocalness of
interpretation: texts become more readable, which makes them accessible for a
larger audience.
§
Authors make less errors, texts
are more precise.
§
Efficiency is increased when
texts are processed.
§
In the case of MT: custom
systems may be built which can reduce human post editing or even produce
correct translations.
3
Requirements for Controlled Language
The following requirements stem from our
own experience and from conferences on controlled languages.
o
Working methods
§
Systems for controlled
languages shall be productive and effective. Complexity shall be conquered by
subdividing problems (as a general principle). For different user needs
different customized products shall be developed.
§
Systems shall be reliable,
especially when they report errors and inconsistencies.
§
The grammars and lexicons (the
“lingware”) shall be adjustable.
§
Up scaling of existing systems
for more general applicability shall be done gradually and carefully,
monitoring correctness, complexity and cost-effectiveness in small steps.
o
Lingware
§
In the grammar formalism there
shall be facilities for rules expressing
-
Readability
-
Translatability
-
Standardization for non-conforming sentences
-
Error reporting and correction
-
Context-sensitivity for XML markup.
§
The understandability of the
lingware shall be maximized. Contributing factors are:
-
The modularity of the grammar
-
The number of grammar rules
-
The interaction between rules
-
The shallowness of parses
-
The discipline of linguists to keep the grammar
unambiguous. Note: Linguists coming straight from universities tend to write
wide-coverage grammars.
-
The consistency for authos.
§
The reliability of the lingware
may depend on the following factors:
-
In the course of time there will be
intra-personal discrepancies.
-
If there is collaborative work on one grammar
there will be danger of inter-personal discrepancies.
o
Training courses for authors
§
Many authors do not like to use
a controlled language, because they feel (or fear) that it restrains them.
Authors shall be assisted as much as possible with technical support.
§
Re-train technical writers on
basic grammar before teaching them complex CL rules (citation from Jeff Allen,
2004a)
§
Users shall be trained
progressively on CL grammar rules. Do not impose many (especially complex) CL
rules to be learned and mastered immediately and perfectly (Allen)
§
Terminology and teaching
methodology from (computational) linguistics shall be minimized (Allen)
§
Do not conduct pilots with
users having limited computer skills and/or expectations of computers to locate
and correct their errors. (Allen)
§
Avoid cross-cultural problems
o
Workbenches for authors
§
Correct and incorrect sentences
shall be marked, e.g. by the colors green and red
§
Reliable error messages shall
be presented
§
Correct alternatives shall be
presented
§
Authors shall be allowed to
postpone a correction
§
The workbench shall provide
facilities for workflow between authors and their managers
§
There shall be a logging
facility for decisions taken by authors.
o
Workbenches for grammar writing
§
The workbench shall provide for
collaboration in workgroups of linguists
§
There have to be facilities for
tracing and debugging
§
There have to be facilities for
test management.
o
Workbenches for lexicons
§
It shall be possible to reuse
existing Machine Readable Dictionaries
§
It shall be possible to access
a thesaurus with preferred terms
§
If a token is not in the
lexicon the author shall be allowed to update the lexicon; the update has to be
approved by the lexicon manager
§
There shall be assistance with
the assignment of lexical features.
4
Current solutions for Controlled Language
Controlled languages are discussed at, for
instance, the Conferences of CLAW and LISA.
In Can
Controlled Languages Scale to the Web?, at CLAW 2006, Jonathan Pool
concludes: “Controlled natural languages that have been reported as successes
have been mainly restrictive: designed for limited, intra-organization or
intra-industry purposes. That they cover single domains and genres, with
repetitive and trainable authors, facilitates their efficacy.” In that paper he
discusses a number of controlled languages.
Jeff Allen presents in Introduction
to Controlled Languages (2004) the following list of controlled languages:
§
Monolingual CLs
-
Basic English (1930s)
-
Caterpillar Fundamental English (CFE) 1970s
-
International Language of Service and
Maintenance (ILSAM)
-
Bull Global English
-
Perkins/Univ Edinburgh PACE
-
AECMA Simplified English (SE)
-
GIFAS Rationalized French
-
Kodak International Service Language Smart
Controlled English
-
General Motors Global English
-
Securities and Exchange Commission (SEC) Plain
English
-
Fight the Fog (European Commission)
-
MultiDoc project Controlled Languages
-
Remedios Ruiz/Richard Sutcliffe Controlled
Spanish
§
Multilingual CLs
-
Caterpillar Technical English (CTE)
-
Attempto Controlled English (ACE)
-
Alcatel COGRAM
-
Xerox Multilingual Customized English
-
Kodak International Service Language
-
General Motors Controlled Automotive Service
Language (CASL)
-
IBM Easy English
-
ProLingua LinguaNet
-
Diebold Controlled English
-
Scania Swedish
-
Smart Controlled English
-
Nortel Standard English (NSE)
-
OCÉ Controlled English
-
Sun Controlled English
-
Avaya Controlled English
-
Oracle ORACAL
-
Allen's Controlled English for DIPLOMAT project
At conferences, few technical details are
reported about the formalisms of grammars and lexicons, about workbenches and
about training courses.
5
Position of LingDoc Transform
Transform has no special provisions for
Controlled Languages.
In general, Transform has a mechanism for
rewriting with transduction grammars.
Transform has been used by the software
House BSO (now part of ATOS Origin) for the analysis stage of a large system
for MT (DLT), building dependency trees which were subsequently transformed.
1
Position of LingDoc Clarity: tools Capri, Lexbench and Author
o

o
Lingware
The grammar formalism of LingDoc Clarity
Capri is a unification of a number of useful formalisms (see Position Paper 4).
The formalism is rich enough for rules expressing readability, translatability,
standardization and error reporting and correction.
Grammar rules may perform syntax directed
translation. The value of lexical features can be manipulated, e.g. for the
correction of agreement.
Precise results are obtained by writing
shallow grammars with many rules.
MT is treated as a correction process.
The lexical scanner handles the separation
between lexical tokens, XML mark-up and other separators.
o
Capri
Capri stands for “Cap Gemini rules interpreter”. It is described in more
detail in the position paper on Grammar Formalisms.
Capri acts as the engine for Author; it can also run independently as an
analyzer or a translator.
o
Author
Author corrects and standardizes language
according to the lingware for the Controlled Language.
Correct and incorrect sentences are marked
(by the colors green and red).
Error messages are presented, together
with alternatives.
There are facilities for workflow between
authors and a manager who oversees the production process and who is an
intermediary to the linguists who maintain the lingware.
o
Lexbench (Workbench for
lexicons)
Lexbench supports multi projects and multi
authors.
It is possible to use multiword’s and
idiomatic expressions.
It is possible to access existing Machine
Readable Dictionaries for getting proposals.
It is possible to access a thesaurus with
preferred terms.
If a token is not in the lexicon the
author may update the lexicon; the update is logged.
There is assistance with the assignment of
lexical features.
o
Workbenches for grammar writing
Up till now there are sparse facilities.
There is a provision to work with modules of sub-grammars, like include files.
Some ideas for test management have been implemented, others are still waiting.
o
Training courses for authors
Training courses for authors, linguists
and Lexbench administrators have been developed.
2
Position of Clarity: Working method for the
development of controlled language
Cap Gemini Lingware Services developed a
working method for the development and adaptation of lingware in a modular and
controllable fashion. This working method was continually refined and included
a cost estimation for potential customers.
The working method is phased like any ICT
project, with typical phases like feasibility study, definition study,
information analysis, functional and technical design, implementation and
deployment. Some activities are as follows.
o
Feasibility study
The language usage within a certain area
of application in the customer organization (the “domain”) is investigated.
Based on the results advice is given on the applicability of restricted
language usage in that particular domain. Also, an estimation is made of the
costs that the introduction of lingware would involve. This is not a trivial
task. A checklist with rules of thumb for cost estimation has been continuously
updated.
o
Customization of Grammars and
Lexicons
§
The lexicons and grammars are
customized per application for the required source and target languages.
§
There is a parallel corpus with
representative test-sentences.
§
Linguistic classifications are
made of sentences and tokens; selections are made on the basis of possible
coverage, ambiguity and relevance for the client.
§
Some types of sentences have to
be rejected or transformed.
§
Grammar rule are gradually
developed.
§
The relation between test
sentences and grammar rules is maintained in order to facilitate later
updates.
§
Rules for correction,
translatability and readability are added.
§
There are many rules for
readability derived from International writing and technical communication
guidelines that can be used and adapted. Similarly, rules for correction of
writing errors can be derived from typologies of user mistakes.
§
If applicable, the document
schema has to be taken into account, as a possible source for disambiguation.
§
The relation is maintained
between the number of grammar rules, the number of lexical entries and the
coverage of the test corpus. This information has to be gathered incrementally
so that insight can be gained in the measure in which prototypes upscale to
production versions.
o
The software may have to be
integrated in other systems.
3
Controlled language projects realized by Cap
Gemini Lingware Services
Until now, Lingware Services has developed
lingware for the following purposes:
§
Correction and standardization
of Dutch help texts
§
Correction and standardization
of Dutch software user manuals
§
Correction and standardization
of Dutch software design texts
§
Correction and standardization
of maintenance manuals written in “Simplified English”
§
Dutch-English translation of
help texts
§
English-Dutch translation of
system texts
§
Dutch-Spanish translation of
help texts
§
Dutch-German and Dutch-English
translation of software design texts
§
Simplified English-French
translation of maintenance manuals.
4
Open source lingware
The lingware for some projects realized by
Cap Gemini Lingware Services are now in open source, as is the software of
Clarity.
See the Readme on lingware.
5
Research questions
Several research questions can present
themselves.
§
What is the influence of the
grammar formalism on usability?
§
How to improve the working
methodology for the creation of lingware?
§
What is the hidden complexity
of the up scaling of lingware?
§
There is a huge need for
sharable CL corpora for different domains. Very few examples (other than AECMA
SE) of writing in CL’s are publicly available.
§
The human factors of the
authors environment.
§
How can we make lingware as consistent
as possible for authors?
§
How can we improve the test
management for lingware?
§
In general: a discipline of
developing controlled languages shall be developed. The scientific rationale is
that the problem space of MT is too large so that sub-problems have to be
specified for which effective solutions can be developed. For Language
Engineering the solutions shall be useful in practical circumstances, reliable
and cost-effective.
6
Literature
o
Documents within LingDoc
§
Position Papers
§
Manuals
o
General purpose
§
Lexicons
§
Thesauri
-
Ref
D:\Know_Language_Technology\Syntens_thesaurus.doc
o
Papers Lingware Services
§
Van der Eijk, Pim; and
Jacqueline van Wees: Supporting controlled language authoring. In: EAMT Workshop, Geneva, 2-3
April 1998.
§
Van der Eijk, Pim; De Koning,
Michiel; and Van der Steen, Gert.
Controlled Language Correction and Translation. Proceedings of the First
International Workshop on Controlled Language Applications (CLAW96). Leuven, Belgium:
Katholieke Universiteit Leuven Centre for
Computational Linguistics, March 26-27, 1996, pp. 64-73.
§
Van der Steen,
G.J.; and Dijenborgh, A.J., "Online correction and translation of
Industrial Texts". In "Translation and the Computer 14: Quality
Standards and the Implementation of Technology in Translation", ASLIB -
The Association for Information Management, London, 1992, pp.135-164.
o
Information about deployment of
controlled languages and MT systems can be found in e.g. “Implementing Machine
Translation, LISA Best Practice Guide”, 2004, http://lisa.org/Best-Practice-Guides
o
Introductions to controlled
language can be found in e.g.
o
Information about CLAW can be
found in:
§
4th Controlled Language
Applications Workshop (CLAW) and Seventh International Workshop of the European
Association for Machine Translation (EAMT) combined as EAMT/CLAW2003 http://www.eamt.org/eamt-claw03/programme.html
§
2nd International Controlled
Language Applications Workshop (CLAW98) http://www.lti.cs.cmu.edu/CLAW98/