and Applications of Language and
Document Engineering: Position Paper
In this Position Paper we will discuss in
which manner Transform and Clarity are suitable for Language and Document
Engineering. First we give an overview of applications in the two fields with
current solutions, according to topics from Linguistics, Document Management
and Computer Science. Second we discuss the approach of LingDoc in more detail.
Language Engineering is the most practical
one in a series of disciplines concerning natural language and computing.
Computational linguistics is an interdisciplinary
field dealing with the rule-based and/or statistical
modeling of natural language from a computational
perspective. This modeling is not limited to any particular field of linguistics.
In general, computational linguistics draws upon the involvement of linguists, computer
scientists, experts in artificial intelligence, cognitive psychologists, mathematicians, and logicians, among others.
Natural language processing (NLP) is a
subfield of artificial intelligence and computational linguistics. It studies the
problems of automated understanding and generation of natural human
Natural-language-understanding systems convert samples of human language into more
formal representations that are easier for computer
programs to manipulate. Natural-language-generation systems convert information
into normal-sounding human language. Statistical NLP uses stochastic,
methods to resolve some of the difficulties arising from rule-based modeling,
especially those which arise from highly ambiguous realistic grammars. Methods
for disambiguation often involve the use of corpora and Markov models.
Language technology adds to NLP many
application oriented aspects.
Language engineering is the most
practical. It concerns the creation of natural language processing systems whose
costs and outputs are measurable and predictable.
In well-known textbooks, like Jurafsky and
Martin (2000, 2008) and Mitkov
the applications of NLP are broadly classified as:
Question Answering and
Dialogue and Conversational
The applications make use of knowledge in
a number of linguistic layers:
Phonetics and Phonology:
oriented towards sound
Morphology: about structure of
Lexicography: about words and
Syntax: about structure of
Semantics: about meaning,
applying knowledge according to a variety of languages and theories
Pragmatics and dialogue: about
contexts, intentions, interactions and environments
Discourse: about the background
of shared meaning and cultural knowledge.
In Language Engineering, for each
linguistic layer that is involved at least the following activities are at
The writing of specifications
The definition of the
functionality of tools and references to be used
The choice of
A typical aspect of practical
Language Engineering is integrated error handling among all linguistic layers.
Errors arise from ill-formed input as well as from wrong or inconsistent
Also all forms of ambiguities
among all linguistic layers have to be handled. The causes of ambiguities may
temporal ambiguity: according
to the specifications there are two or more possible ways to proceed in order
to reach a final structural description;
according to the specifications there are several structural descriptions possible:
due to insufficient knowledge and (wide) context
and/or lack of sufficient coverage by grammars, lexicons and thesauri.
built-in strategies for error-recovery which try
to recover along different paths.
A typical aspect of
Computational Linguistics may be the requirement that a competence grammar is
congruent with the computational model. In that case a partial semantic representation
has to be built during on-line processing. All linguistic layers have to play
Machine Translation (MT) can be
seen as the hardest problem of Linguistics; it is an ultimate test for
linguistic theory. Nowadays linguistic theory is far from complete. Practical
solutions in Language Engineering have to place limitations on the expressions
in natural language. That is the topic of the position paper on “LingDoc and
Lingware Management for MT and Controlled Languages”.
Tools shall be optimized for
For each type of application the textbooks
discuss the approaches that have been taken up till now.
There are commercial packages for Language
Engineering, like Natlanco, PATR II and Textkernel. Free packages are provided
by universities. LISA publishes overviews of tools for the process of (Human
Aided) Machine Translation.
The textbooks do not discuss:
Integrated handling of
ambiguity through all linguistic layers
Integrated error handling
Relevant topics from Computer
program generation for formal automata, in order
to optimize speed
using parse forests by formal automata for the
efficient handling of ambiguities
LingDoc is aimed at Language Engineering,
solving complexity problems by a combination of useful formalisms and advanced
computer science. It is intended for language understanding and not for
language generation. It does not use a statistical approach.
Transform directs itself
towards many types of applications, especially as an engine for analysis and
The specifications in Transform
are written as formal grammars and lexicons.
Transform can be used at the
levels of Phonology, Morphology and Syntax.
Cascaded grammars may be used
e.g. for the integration of linguistic layers and for the modularization of
large (transduction) grammars.
Transform has the on-line
Clarity directs itself towards
the use of controlled languages and towards machine translation of (very)
controlled informative texts.
Specifications in Clarity are
written as formal grammars, lexicons and thesauri.
Clarity can be used on the
level of Syntax.
See also the separate position
paper “LingDoc and Lingware Management for Controlled Languages and MT”.
deals with the specification, design and implementation of documents and the
processes that create and handle them. Documents
can be anything between book pages and web pages, with a number of appearances.
In general a document may be viewed as a stream of bytes. This view accounts
for the applicability of techniques for Language Engineering.
Typical activities in Document Engineering
are structuring, conversion, information extraction and comparison of
The handling of formal
descriptions for structure and presentation of documents, like document schemas
and style sheets in XML.
Document conversion may be compared with
Machine Translation. As such it is the hardest problem of Document Engineering.
In general XML up-conversion can be seen as an ultimate test for handling of
context-sensitivity, errors and inconsistencies.
Continuous conversion processes require
on-line (streaming) processing.
Information extraction from documents is
facilitated by the formal structure of documents. Specifications for the
extraction may be written as patterns or as queries for databases which hold
(Structured) documents may be compared in
order to locate their differences.
With the advent of SGML in 1988
and its successor XML in 1998
a host of solutions for Document Engineering has been
Solutions for up-conversion to
XML, and it’s problems, are discussed in the position paper: “LingDoc and XML
Transform can be used for the
transformation of documents: e.g. for the up-conversion to XML. In that case,
the source document is in a non-XML format; the target document is in XML.
In essence, a document is
treated like a sentence in a linguistic way. Conversion is treated like Machine
Translation. Transform can be used for both processes.
The structure of the source
document is specified by a formal grammar, akin to the document schema of the
This is further discussed in
the separate position paper: “LingDoc and XML Up-Conversion”.
RevXml tracks all changes
between two XML, XHTML or SGML tagged documents. The tool generates revision
markers (indicating addition, deletion and change) in the revised document,
which themselves take the form of additional XML tags. RevXml takes into
account the tree structure of the documents, the attributes and the text
between tags. The basic unit of a text may be a character, a word, a sentence
or the complete text in between two tags. No document schema is required. (See further
the “Management Summary LingDoc RevXml”.)