Workpackage 5: Neo-Latin Morphological Analyzer

 

 

Year 2 Progress Reports

June 2003 Ð May 2004

 

Executive Summary

 

Accomplishments

 

During the second year of work, the WP5 kept on the development of a new version of the Latin morphological analyser LEMLAT, adding new information about the input word forms. A demo of this new version (named CHLT LEMLAT) and the source code of the program are available on the CHLT website (http://www.chlt.org).

 

Particularly, the following results have been accomplished in order to develop CHLT LEMLAT:

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories;

¯    coding of SF;

¯    modifications on the LES archive, caused by problems in adding gender codes and coding SF;

¯    a MySQL database for the management of the LES archive has been designed and implemented;

¯    implementation of the algorithm for the complete morphological analysis of the wordforms with structure LES + SF;

¯    some procedures for the use of the database by LEMLAT modules have been implemented;

¯    some applications for specific handlings of the informations contained in the LES archive have been implemented;

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories;

¯    building up a client version of LEMLAT;

¯    building up and testing LEMLAT for LINUX platform;

¯    reorganising LEMLAT source code;

¯    coding of the SM and management of the wordforms with structure LES+SM+SF;

¯    coding of FE;

¯    coding as FE of the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs;

¯    coding of the N, V, PR LES;

¯    coding of the P1-P9 and P18 LES;

¯    management of I LES;

¯    coding of the Type of the Òno f TypeÓ adjectives;

¯    FE management;

¯    design of a general rule for the management of LE; in particular, some special rules have been designed in order to solve the problem of the contrast between SF coding and LE rule and to solve some exceptions;

¯    tables for Initial graphical variations and post-final segments;

¯    identification of the morphological values to be attributed to the LE of COD LES N6*, N7* and Pluralia Tantum;

¯    management of N, V, PR, P1-P9 and P18 LES;

¯    management of irregular Type;

¯    management of the present participles, past participles, future participles and  irregular gerundives;

¯    output re-organization and XML format;

¯    writing of four articles;

¯    implementation of a library in C standard language that can be linked in several applications;

¯    implementation of a consolle interactive application;

¯    implementation of a CGI application currently running on CHLT website at the WP5 page (http://www.chlt.org/~cnr/);

¯    implementation of an application for the morphological analysis of a text, producing output in a predefined format suitable both for visualisation and rielaboration.

 

 

Next steps to reach by the end of the project

 

¯    testing CHLT LEMLAT lemmatization results: a number of Latin texts (and/or singular wordforms) will be submitted to CHLT LEMLAT. The results will be analitically checked in order to find out possible mistakes;

¯    source code testing and validation;

¯    source code documentation in order to help developers to bulid up specific applications;

¯    definition and documentation of CHLT LEMLAT results: producing a ÒTEI compliantÓ DTD specifically designed with morphological elements and attributes;

¯    implementation of a web-based application for text, or text fragments morphological analysis.

 

 

Insights for the future (CHLT 2?)

The aim should be the development of a multi-modular tool (that is, a tool using more different modules) that allows the user to query a corpus of Latin texts.

We focused our attention and interest on Latin: the same can, obviously, be done even for Old-Norse and other languages.

The kinds of query we'd like to be able to answer to are, at least, the following:

¯    on a merely morphological level: for instance, the user can know all the wordforms inflected as first declension, singular genitive in -ai nouns occuring in the texts of Cicero. The homographs are not disambiguated;

¯    on a morpho-syntactic level: the homographs are disambiguated. The user can know where and if a partucular kind of syntactic structure occurs in the texts of Cicero;

¯    on a semantic level: the user searchs for the word love (in English!) and obtains as an answer all the lemmas whose semantic definition contains love as first, second, third,... meaning, metaphorical use, technical use...;

¯    statistics: each lemma is accompained by its use frequency in the corpus (structured per author, age, book, style of the book,...). Each wordform is bound to its morphological (no disambiguation of the homographs) and morpho-syntactic frequency in the corpus. Each lemma is part of a "semantic family" (SF) and a "morphological family" (MF): an SF contains all the lemmas having a common meaning in the definition; an MF contains all the lemmas have a common stem in the stemming procedure;

¯    Greek-Latin relationship through English: all the Latin lemmas are related to the corrispondent Greek lemma (linked), selected through the common meaning in the dictionary.

 

The general structure of the analysis of a text is the following (the example is about Latin, but is suitable even for other languages):

1.     Input Latin text (from the CHLT corpus),

2.     Morphological analysis (CHLT LEMLAT),

3.     Morpho-syntactic analysis (Stemming and Syntactic Parser),

4.     Dictionary entry (lemma) with (a) statistical information, (b) structured semantic description (SF and MF) and (c) link to Greek dictionary.

 

Here there is a possible list of the Workpackages in a no-gerarchic order: