3. LingDoc and Lingware Management for Controlled Languages and MT: Position Paper

 

In this position paper we describe the use of LingDoc for the handling of controlled languages and for their subsquent translation.

1         Problems with Language Engineering

In the Position Paper “LingDoc and Applications of Language and Document Engineering” the related disciplines of Language Engineering were defined. Each of these disciplines is in a stage of development.

For Linguistics, there is a constant movement in linguistic theories, and change in formalisms. A formal theory of translation is needed.

Computational Linguistics makes linguistic theory computational, but if there is no sufficient theory the solutions tend to become ad-hoc. There is a necessity to formalize the meaning of words, sentences and discourse. Accurate translation requires an understanding of the context, as well as an understanding of the structure and rules of the language.

In Natural Language Processing (NLP) there is a problem with the complexity of unrestricted language. Vocabularies are vast, with many ambiguities. There is no natural language for which a complete grammar exists.

The worst case scenario of NLP is found in systems for Machine Translation (MT). Most practical MT systems require human pre- and/or post-editing. This considerably reduces the cost-effectiveness of such systems. Moreover, input texts are frequently ill-formed. Therefore, unrestricted MT is considered not possible with a grammar based approach.

The alternatives are either to follow a statistical approach with machine-learned rules or to use a restricted language. The first approach is still in a stage of development. We concentrate on the second one.

2         The rationale of Controlled Language

For Language Engineering the use of restricted, or controlled, language is a viable approach. The terms “restricted” and “controlled” are used more or less interchangeably.

An example is the correction of the following sentence:

 

 

 

 

 

 

 

 

In general, the restriction may concern readability and/or translatability. This corresponds with the distinction between monolingual and multilingual applications.

The restriction may concern only the lexicon. An example is “AECMA Simplified English” which the aviation industry has decided to introduce as the standard for their maintenance manuals. On the use of grammar style only an advice is given to authors. On the other hand, the restriction may also concern the formal grammar. For instance, one of the first fully automatic MT systems that worked correctly was a system for weather forecasts in Canada, called Meteo. The language was heavily restricted, made possible by the restricted domain.

In short, the advantages of controlled languages, compared with unrestricted natural language, are:

§       The vocabulary becomes smaller.

§       The size of the grammar is reduced.

§       Ambiguities are eliminated.

§       There is unequivocalness of interpretation: texts become more readable, which makes them accessible for a larger audience.

§       Authors make less errors, texts are more precise.

§       Efficiency is increased when texts are processed.

§       In the case of MT: custom systems may be built which can reduce human post editing or even produce correct translations.

3         Requirements for Controlled Language

The following requirements stem from our own experience and from conferences on controlled languages.

o                                            Working methods

§       Systems for controlled languages shall be productive and effective. Complexity shall be conquered by subdividing problems (as a general principle). For different user needs different customized products shall be developed.

§       Systems shall be reliable, especially when they report errors and inconsistencies.

§       The grammars and lexicons (the “lingware”) shall be adjustable.

§       Up scaling of existing systems for more general applicability shall be done gradually and carefully, monitoring correctness, complexity and cost-effectiveness in small steps.

o                                            Lingware

§       In the grammar formalism there shall be facilities for rules expressing

-                                                              Readability

-                                                              Translatability

-                                                              Standardization for non-conforming sentences

-                                                              Error reporting and correction

-                                                              Context-sensitivity for XML markup.

§       The understandability of the lingware shall be maximized. Contributing factors are:

-                                                              The modularity of the grammar

-                                                              The number of grammar rules

-                                                              The interaction between rules

-                                                              The shallowness of parses

-                                                              The discipline of linguists to keep the grammar unambiguous. Note: Linguists coming straight from universities tend to write wide-coverage grammars.

-                                                              The consistency for authos.

§       The reliability of the lingware may depend on the following factors:

-                                                              In the course of time there will be intra-personal discrepancies.

-                                                              If there is collaborative work on one grammar there will be danger of inter-personal discrepancies.

o                                            Training courses for authors

§       Many authors do not like to use a controlled language, because they feel (or fear) that it restrains them. Authors shall be assisted as much as possible with technical support.

§       Re-train technical writers on basic grammar before teaching them complex CL rules (citation from Jeff Allen, 2004a)

§       Users shall be trained progressively on CL grammar rules. Do not impose many (especially complex) CL rules to be learned and mastered immediately and perfectly (Allen)

§       Terminology and teaching methodology from (computational) linguistics shall be minimized (Allen)

§       Do not conduct pilots with users having limited computer skills and/or expectations of computers to locate and correct their errors. (Allen)

§       Avoid cross-cultural problems

o                                            Workbenches for authors

§       Correct and incorrect sentences shall be marked, e.g. by the colors green and red

§       Reliable error messages shall be presented

§       Correct alternatives shall be presented

§       Authors shall be allowed to postpone a correction

§       The workbench shall provide facilities for workflow between authors and their managers

§       There shall be a logging facility for decisions taken by authors.

o                                            Workbenches for grammar writing

§       The workbench shall provide for collaboration in workgroups of linguists

§       There have to be facilities for tracing and debugging

§       There have to be facilities for test management.

o                                            Workbenches for lexicons

§       It shall be possible to reuse existing Machine Readable Dictionaries

§       It shall be possible to access a thesaurus with preferred terms

§       If a token is not in the lexicon the author shall be allowed to update the lexicon; the update has to be approved by the lexicon manager

§       There shall be assistance with the assignment of lexical features.

4         Current solutions for Controlled Language

Controlled languages are discussed at, for instance, the Conferences of CLAW and LISA.

In Can Controlled Languages Scale to the Web?, at CLAW 2006, Jonathan Pool concludes: “Controlled natural languages that have been reported as successes have been mainly restrictive: designed for limited, intra-organization or intra-industry purposes. That they cover single domains and genres, with repetitive and trainable authors, facilitates their efficacy.” In that paper he discusses a number of controlled languages.

Jeff Allen presents in Introduction to Controlled Languages (2004) the following list of controlled languages:

§       Monolingual CLs

-                                                              Basic English (1930s)

-                                                              Caterpillar Fundamental English (CFE) 1970s

-                                                              International Language of Service and Maintenance (ILSAM)

-                                                              Bull Global English

-                                                              Perkins/Univ Edinburgh PACE

-                                                              AECMA Simplified English (SE)

-                                                              GIFAS Rationalized French

-                                                              Kodak International Service Language Smart Controlled English

-                                                              General Motors Global English

-                                                              Securities and Exchange Commission (SEC) Plain English

-                                                              Fight the Fog (European Commission)

-                                                              MultiDoc project Controlled Languages

-                                                              Remedios Ruiz/Richard Sutcliffe Controlled Spanish

§       Multilingual CLs

-                                                              Caterpillar Technical English (CTE)

-                                                              Attempto Controlled English (ACE)

-                                                              Alcatel COGRAM

-                                                              Xerox Multilingual Customized English

-                                                              Kodak International Service Language

-                                                              General Motors Controlled Automotive Service Language (CASL)

-                                                              IBM Easy English

-                                                              ProLingua LinguaNet

-                                                              Diebold Controlled English

-                                                              Scania Swedish

-                                                              Smart Controlled English

-                                                              Nortel Standard English (NSE)

-                                                              OCÉ Controlled English

-                                                              Sun Controlled English

-                                                              Avaya Controlled English

-                                                              Oracle ORACAL

-                                                              Allen's Controlled English for DIPLOMAT project

At conferences, few technical details are reported about the formalisms of grammars and lexicons, about workbenches and about training courses.

5         Position of LingDoc Transform

Transform has no special provisions for Controlled Languages.

In general, Transform has a mechanism for rewriting with transduction grammars.

Transform has been used by the software House BSO (now part of ATOS Origin) for the analysis stage of a large system for MT (DLT), building dependency trees which were subsequently transformed.

1         Position of LingDoc Clarity: tools Capri, Lexbench and Author

o                                           

o                                            Lingware

The grammar formalism of LingDoc Clarity Capri is a unification of a number of useful formalisms (see Position Paper 4). The formalism is rich enough for rules expressing readability, translatability, standardization and error reporting and correction.

Grammar rules may perform syntax directed translation. The value of lexical features can be manipulated, e.g. for the correction of agreement.

Precise results are obtained by writing shallow grammars with many rules.

MT is treated as a correction process.

The lexical scanner handles the separation between lexical tokens, XML mark-up and other separators.

o                                            Capri

Capri stands for “Cap Gemini rules interpreter”. It is described in more detail in the position paper on Grammar Formalisms.

Capri acts as the engine for Author; it can also run independently as an analyzer or a translator.

o                                            Author

Author corrects and standardizes language according to the lingware for the Controlled Language.

Correct and incorrect sentences are marked (by the colors green and red).

Error messages are presented, together with alternatives.

There are facilities for workflow between authors and a manager who oversees the production process and who is an intermediary to the linguists who maintain the lingware.

o                                            Lexbench (Workbench for lexicons)

Lexbench supports multi projects and multi authors.

It is possible to use multiword’s and idiomatic expressions.

It is possible to access existing Machine Readable Dictionaries for getting proposals.

It is possible to access a thesaurus with preferred terms.

If a token is not in the lexicon the author may update the lexicon; the update is logged.

There is assistance with the assignment of lexical features.

o                                            Workbenches for grammar writing

Up till now there are sparse facilities. There is a provision to work with modules of sub-grammars, like include files. Some ideas for test management have been implemented, others are still waiting.

o                                            Training courses for authors

Training courses for authors, linguists and Lexbench administrators have been developed.

2         Position of Clarity: Working method for the development of controlled language

Cap Gemini Lingware Services developed a working method for the development and adaptation of lingware in a modular and controllable fashion. This working method was continually refined and included a cost estimation for potential customers.

The working method is phased like any ICT project, with typical phases like feasibility study, definition study, information analysis, functional and technical design, implementation and deployment. Some activities are as follows.

o                                            Feasibility study

The language usage within a certain area of application in the customer organization (the “domain”) is investigated. Based on the results advice is given on the applicability of restricted language usage in that particular domain. Also, an estimation is made of the costs that the introduction of lingware would involve. This is not a trivial task. A checklist with rules of thumb for cost estimation has been continuously updated.

o                                            Customization of Grammars and Lexicons 

§       The lexicons and grammars are customized per application for the required source and target languages.

§       There is a parallel corpus with representative test-sentences.

§       Linguistic classifications are made of sentences and tokens; selections are made on the basis of possible coverage, ambiguity and relevance for the client.

§       Some types of sentences have to be rejected or transformed.

§       Grammar rule are gradually developed.

§       The relation between test sentences and grammar rules is maintained in order to facilitate later updates. 

§       Rules for correction, translatability and readability are added.

§       There are many rules for readability derived from International writing and technical communication guidelines that can be used and adapted. Similarly, rules for correction of writing errors can be derived from typologies of user mistakes.

§       If applicable, the document schema has to be taken into account, as a possible source for disambiguation.

§       The relation is maintained between the number of grammar rules, the number of lexical entries and the coverage of the test corpus. This information has to be gathered incrementally so that insight can be gained in the measure in which prototypes upscale to production versions.

o                                            The software may have to be integrated in other systems.

3         Controlled language projects realized by Cap Gemini Lingware Services

Until now, Lingware Services has developed lingware for the following purposes:

§       Correction and standardization of Dutch help texts

§       Correction and standardization of Dutch software user manuals

§       Correction and standardization of Dutch software design texts

§       Correction and standardization of maintenance manuals written in “Simplified English”

§       Dutch-English translation of help texts

§       English-Dutch translation of system texts

§       Dutch-Spanish translation of help texts

§       Dutch-German and Dutch-English translation of software design texts

§       Simplified English-French translation of maintenance manuals.

4         Open source lingware

The lingware for some projects realized by Cap Gemini Lingware Services are now in open source, as is the software of Clarity.

See the Readme on lingware.

5         Research questions

Several research questions can present themselves.

§       What is the influence of the grammar formalism on usability?

§       How to improve the working methodology for the creation of lingware?

§       What is the hidden complexity of the up scaling of lingware?

§       There is a huge need for sharable CL corpora for different domains. Very few examples (other than AECMA SE) of writing in CL’s are publicly available.

§       The human factors of the authors environment.

§       How can we make lingware as consistent as possible for authors?

§       How can we improve the test management for lingware?

§       In general: a discipline of developing controlled languages shall be developed. The scientific rationale is that the problem space of MT is too large so that sub-problems have to be specified for which effective solutions can be developed. For Language Engineering the solutions shall be useful in practical circumstances, reliable and cost-effective.

6         Literature

o                                            Documents within LingDoc

§       Position Papers

§       Manuals

o                                            General purpose

§       Lexicons

§       Thesauri

-                                                              Ref D:\Know_Language_Technology\Syntens_thesaurus.doc

o                                            Papers Lingware Services

§       Van der Eijk, Pim; and Jacqueline van Wees: Supporting controlled language authoring. In: EAMT Workshop, Geneva, 2-3 April 1998.

§       Van der Eijk, Pim; De Koning, Michiel; and Van der Steen, Gert. Controlled Language Correction and Translation. Proceedings of the First International Workshop on Controlled Language Applications (CLAW96). Leuven, Belgium: Katholieke Universiteit Leuven Centre for Computational Linguistics, March 26-27, 1996, pp. 64-73.

§       Van der Steen, G.J.; and Dijenborgh, A.J., "Online correction and translation of Industrial Texts". In "Translation and the Computer 14: Quality Standards and the Implementation of Technology in Translation", ASLIB - The Association for Information Management, London, 1992, pp.135-164.

o                                            Information about deployment of controlled languages and MT systems can be found in e.g. “Implementing Machine Translation, LISA Best Practice Guide”, 2004, http://lisa.org/Best-Practice-Guides

o                                            Introductions to controlled language can be found in e.g.

§       Jeff Allen and Kathy Barthe, 2004a, “Introduction to Controlled Languages”, http://www.geocities.com/controlledlanguage/Allen-Barthe-CL-intro-STC-France-v1.01-2Apr2004.ppt

§       Jeff Allen, 2004b, “Multilingual Machine-Oriented Controlled Languages”, http://www.geocities.com/controlledlanguage/Allen-multiling-CL-STC-Fr-v1.00-2apr2004.ppt

o                                            Information about CLAW can be found in:

§       5th International Workshop on Controlled Language Applications (CLAW 2006) http://www.geocities.com/controlledlanguage/CLAW2006.doc

§       4th Controlled Language Applications Workshop (CLAW) and Seventh International Workshop of the European Association for Machine Translation (EAMT) combined as EAMT/CLAW2003 http://www.eamt.org/eamt-claw03/programme.html

§       Link to several EAMT/CLAW2003 Conference presentations: http://www.ctts.dcu.ie/presentations.html

§       3rd International Controlled Language Applications Workshop (CLAW 2000) http://www.up.univ-mrs.fr/~veronis/claw2000/

§       2nd International Controlled Language Applications Workshop (CLAW98) http://www.lti.cs.cmu.edu/CLAW98/

§       1st International Controlled Language Applications Workshop (CLAW96) http://www.ccl.kuleuven.ac.be/CLAW/programme.html