1.  LingDoc and  Applications of Language and Document Engineering: Position Paper


In this Position Paper we will discuss in which manner Transform and Clarity are suitable for Language and Document Engineering. First we give an overview of applications in the two fields with current solutions, according to topics from Linguistics, Document Management and Computer Science. Second we discuss the approach of LingDoc in more detail.

1         Description of Language Engineering

Language Engineering is the most practical one in a series of disciplines concerning natural language and computing.

§       Computational Linguistics

Computational linguistics is an interdisciplinary field dealing with the rule-based and/or statistical modeling of natural language from a computational perspective. This modeling is not limited to any particular field of linguistics. In general, computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists, mathematicians, and logicians, among others.

§       NLP

Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated understanding and generation of natural human languages.
Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. Natural-language-generation systems convert information into normal-sounding human language. Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some of the difficulties arising from rule-based modeling, especially those which arise from highly ambiguous realistic grammars. Methods for disambiguation often involve the use of corpora and Markov models.

§       Language Technology

Language technology adds to NLP many application oriented aspects.

§       Language Engineering

Language engineering is the most practical. It concerns the creation of natural language processing systems whose costs and outputs are measurable and predictable.

In well-known textbooks, like Jurafsky and Martin (2000, 2008) and Mitkov (2005) the applications of NLP are broadly classified as:

§       Information Extraction

§       Question Answering and Summarization

§       Dialogue and Conversational Agents

§       Machine Translation.

The applications make use of knowledge in a number of linguistic layers:

§       Phonetics and Phonology: oriented towards sound

§       Morphology: about structure of words

§       Lexicography: about words and idiom

§       Syntax: about structure of sentences

§       Semantics: about meaning, applying knowledge according to a variety of languages and theories

§       Pragmatics and dialogue: about contexts, intentions, interactions and environments

§       Discourse: about the background of shared meaning and cultural knowledge.

In Language Engineering, for each linguistic layer that is involved at least the following activities are at play:

§       The writing of specifications

§       The definition of the functionality of tools and references to be used

§       The choice of development tools.

o      Requirements

§       A typical aspect of practical Language Engineering is integrated error handling among all linguistic layers. Errors arise from ill-formed input as well as from wrong or inconsistent specifications.

§       Also all forms of ambiguities among all linguistic layers have to be handled. The causes of ambiguities may be:

-    temporal ambiguity: according to the specifications there are two or more possible ways to proceed in or­der to reach a final structural description;

-    inherent ambiguity: according to the specifications there are several structural de­scriptions pos­sible:

-      due to insufficient knowledge and (wide) context
-      and/or lack of sufficient coverage by grammars, lexicons and thesauri.

-    built-in strategies for error-recovery which try to recover along different paths.

§       A typical aspect of Computational Linguistics may be the requirement that a competence grammar is congruent with the computational model. In that case a partial semantic representation has to be built during on-line processing. All linguistic layers have to play in concert.

§       Machine Translation (MT) can be seen as the hardest problem of Linguistics; it is an ultimate test for linguistic theory. Nowadays linguistic theory is far from complete. Practical solutions in Language Engineering have to place limitations on the expressions in natural language. That is the topic of the position paper on “LingDoc and Lingware Management for MT and Controlled Languages”.

§       Tools shall be optimized for usability.

o      Current solutions

For each type of application the textbooks discuss the approaches that have been taken up till now.

There are commercial packages for Language Engineering, like Natlanco, PATR II and Textkernel. Free packages are provided by universities. LISA publishes overviews of tools for the process of (Human Aided) Machine Translation.

o      Problems with current solutions

The textbooks do not discuss:

§       Integrated handling of ambiguity through all linguistic layers

§       Integrated error handling

§       Relevant topics from Computer Science, like

-    program generation for formal automata, in order to optimize speed

-    using parse forests by formal automata for the efficient handling of ambiguities

-    complexity issues.

o      Position of LingDoc in general

LingDoc is aimed at Language Engineering, solving complexity problems by a combination of useful formalisms and advanced computer science. It is intended for language understanding and not for language generation. It does not use a statistical approach.

o      Position of LingDoc Transform

§       Transform directs itself towards many types of applications, especially as an engine for analysis and transduction

§       The specifications in Transform are written as formal grammars and lexicons.

§       Transform can be used at the levels of Phonology, Morphology and Syntax.

§       Cascaded grammars may be used e.g. for the integration of linguistic layers and for the modularization of large (transduction) grammars.

§       Transform has the on-line property.

o      Position of LingDoc Clarity

§       Clarity directs itself towards the use of controlled languages and towards machine translation of (very) controlled informative texts.

§       Specifications in Clarity are written as formal grammars, lexicons and thesauri.

§       Clarity can be used on the level of Syntax.

§       See also the separate position paper “LingDoc and Lingware Management for Controlled Languages and MT”.

2         Description of Document Engineering

Document engineering deals with the specification, design and implementation of documents and the processes that create and handle them. Documents can be anything between book pages and web pages, with a number of appearances. In general a document may be viewed as a stream of bytes. This view accounts for the applicability of techniques for Language Engineering.

Typical activities in Document Engineering are structuring, conversion, information extraction and comparison of documents.

o      Requirements

§       Structuring

The handling of formal descriptions for structure and presentation of documents, like document schemas and style sheets in XML.

§       Conversion

Document conversion may be compared with Machine Translation. As such it is the hardest problem of Document Engineering. In general XML up-conversion can be seen as an ultimate test for handling of context-sensitivity, errors and inconsistencies.

Continuous conversion processes require on-line (streaming) processing.

§       Information extraction

Information extraction from documents is facilitated by the formal structure of documents. Specifications for the extraction may be written as patterns or as queries for databases which hold the documents.

§       Comparison

(Structured) documents may be compared in order to locate their differences.

o      Current solutions and their problems

§       With the advent of SGML in 1988 and its successor XML in 1998 a host of solutions for Document Engineering has been produced.

§       Solutions for up-conversion to XML, and it’s problems, are discussed in the position paper: “LingDoc and XML Up-Conversion”.

o      Position of LingDoc Transform

§       Transform can be used for the transformation of documents: e.g. for the up-conversion to XML. In that case, the source document is in a non-XML format; the target document is in XML.

§       In essence, a document is treated like a sentence in a linguistic way. Conversion is treated like Machine Translation. Transform can be used for both processes.

§       The structure of the source document is specified by a formal grammar, akin to the document schema of the target document.

§       This is further discussed in the separate position paper: “LingDoc and XML Up-Conversion”.

o      Position of LingDoc RevXml

§       RevXml tracks all changes between two XML, XHTML or SGML tagged documents. The tool generates revision markers (indicating addition, deletion and change) in the revised document, which themselves take the form of additional XML tags. RevXml takes into account the tree structure of the documents, the attributes and the text between tags. The basic unit of a text may be a character, a word, a sentence or the complete text in between two tags. No document schema is required. (See further the “Management Summary LingDoc RevXml”.)

3         Literature

§       LingDoc Documents

-    Position Papers

-    Manuals

§       Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2000, 2008. Prentice-Hall.

§       Ruslan Mitkov, The Oxford Handbook of Computational Linguistics, Oxford Handbooks in Linguistics, 2005. Cambridge University Press.