Summary LingDoc Transform

What is LingDoc Transform?

Transform is a general-purpose system for the analysis of natural language texts, for the transformation of the structure of documents and for information extraction.

What tasks are performed with Transform?

Transform has been used in a wide variety of projects, e.g. for the

o      recognition of complicated patterns in tagged corpora of texts and music

o      parsing of a number of natural languages

o      transformation of graphemes into phonemes

o      recognition of compound words

o      classification of Latin words

o      conversion of a Russian-Dutch dictionary into a Dutch-Russian one

o      detection of the structure of documents and their subsequent conversion into another format like XML

o      preprocessing of texts

o      construction of scripting languages.

In general, any phenomenon in which sequence is involved and which contains some regularity can be described by LingDoc Transform. The description is done in a grammatical formalism. With Transform it is possible to explore the possible regularities in the phenomenon in order to rigorously describe the detected structures. New samples of the phenomenon can be checked against this description. Relevant portions can be extracted and handed over to external programs. Transformations can be performed.

In this way, Transform is equally suited to natural language tasks as well as to tasks like up-conversion to XML, starting from scratch.

What is the difference between Transform and similar systems?

Transform has the streaming property. That means that the input and output may be of indefinite length. In technical terms: the engine of Transform has the on-line property.

Transform has two modes: one for pattern-matching and –replacement (PMR) and one for parsing and transduction according to grammars and lexicons (the lingware). The rules for PMR may contain nonterminals, like in grammars. These nonterminals may describe left and right context of indefinite size in order to disambiguate the applicability of PMR rules. The system has error-handling capabilities.

Because of this combination it is possible to describe a text rigorously and, at the same time, convert elements in the text. In that respect Transform differs from tools which conduct simple pattern search and replacement where any intervening material is not checked for errors and inconsistencies.

Who are using Transform?

Transform has been used in academic environments for a wide variety of tasks. In industrial environments it has been used for the up-conversion of documents to LaTex, SGML and XML. It has also been used for the parsing phase of an industrial MT system, based upon dependency grammars.

Why is Transform interesting?

Transform is based on unique extensions to the theory of formal grammars and automata. See also the position papers, articles and technical documentation.

Platforms

Transform consists of a compiler and a runtime system. These may run on different platforms because the generated programs are written in pseudo code, which is interpreted by the runtime system. Transform itself is written in Pascal and has been ported to a variety of platforms. Currently, Transform is maintained in Borland’s Developer Studio 2006. The programs can be called by a separate GUI which has been written in Java.

History

Transform has been developed at the University of Amsterdam, Faculty of Humanities.

Former names of Transform are: Parspat, Atlas and AddXml. The development took eight years, in which a number of grammatical formalisms were unified and original research in Computer Science was performed.

The main developers were Gert van der Steen, Marijke Elstrodt and Pieter Masereeuw. Numerous people contributed, as users, in continuous discussions with the developers, to the welfare of the system. A number of doctoral dissertations are based upon the results obtained with Transform. Gert van der Steen wrote a PhD thesis on the system itself.

In the nineteen’s Transform has been used in industrial environments, mainly for up-conversion to SGML.

In 2003 the main developers and users had left the university and the system was transferred to Palstar, where a gui and an alternative syntax for the grammar formalism (according to the W3C) have been developed, together with manuals and a training course, with as a running example the up-conversion to XML of an aerospace manual.

In 2007 the University of Amsterdam granted Palstar the right to publish Transform as open source.

More information

More information can be found in the position papers.

Gert van der Steen wrote a thesis on the subject of Transform: Van der Steen, G.J., A program generator for recognition, parsing and transduction with syntactic patterns. CWI Tract 55, Centre for Mathematics and Computer Science, P.O. Box 4079, 1009 AB  Amsterdam, The Netherlands, 284 pp., 1988, ISBN 90 6196 361 3.