Workpackage 5: Neo-Latin Morphological
Analyzer
Year 2 Progress Reports
June 2003 Ð May 2004
Executive Summary
Accomplishments
During the second year of work, the
WP5 kept on the development of a new version of the Latin morphological
analyser LEMLAT, adding new information about the input word forms. A demo of
this new version (named CHLT LEMLAT) and the source code of the program are
available on the CHLT website (http://www.chlt.org).
Particularly, the following results
have been accomplished in order to develop CHLT LEMLAT:
¯ adding of the gender codes to the
LES belonging to ambiguous morphological categories;
¯ coding of SF;
¯ modifications on the LES archive,
caused by problems in adding gender codes and coding SF;
¯ a MySQL database for the management
of the LES archive has been designed and implemented;
¯ implementation of the algorithm for
the complete morphological analysis of the wordforms with structure LES + SF;
¯ some procedures for the use of the
database by LEMLAT modules have been implemented;
¯ some applications for specific
handlings of the informations contained in the LES archive have been
implemented;
¯ adding of the gender codes to the
LES belonging to ambiguous morphological categories;
¯ building up a client version of
LEMLAT;
¯ building up and testing LEMLAT for
LINUX platform;
¯ reorganising LEMLAT source code;
¯ coding of the SM and management of
the wordforms with structure LES+SM+SF;
¯ coding of FE;
¯ coding as FE of the adjective
wordforms ending in -um, -i, -o, -a that are used as adverbs;
¯ coding of the N, V, PR LES;
¯ coding of the P1-P9 and P18 LES;
¯ management of I LES;
¯ coding of the Type of the Òno f
TypeÓ adjectives;
¯ FE management;
¯ design of a general rule for the
management of LE; in particular, some special rules have been designed in order
to solve the problem of the contrast between SF coding and LE rule and to solve
some exceptions;
¯ tables for Initial graphical
variations and post-final segments;
¯ identification of the morphological
values to be attributed to the LE of COD LES N6*, N7* and Pluralia Tantum;
¯ management of N, V, PR, P1-P9 and
P18 LES;
¯ management of irregular Type;
¯ management of the present
participles, past participles, future participles and irregular gerundives;
¯ output re-organization and XML format;
¯ writing of four articles;
¯ implementation of a library in C
standard language that can be linked in several applications;
¯ implementation of a consolle
interactive application;
¯ implementation of a CGI application
currently running on CHLT website at the WP5 page (http://www.chlt.org/~cnr/);
¯ implementation of an application for
the morphological analysis of a text, producing output in a predefined format
suitable both for visualisation and rielaboration.
¯ testing CHLT LEMLAT lemmatization
results: a number of Latin texts (and/or singular wordforms) will be submitted
to CHLT LEMLAT. The results will be analitically checked in order to find out
possible mistakes;
¯ source code testing and validation;
¯ source code documentation in order
to help developers to bulid up specific applications;
¯ definition and documentation of CHLT
LEMLAT results: producing a ÒTEI compliantÓ DTD specifically designed with
morphological elements and attributes;
¯ implementation of a web-based
application for text, or text fragments morphological analysis.
Insights for the future (CHLT
2?)
The aim should be the development of
a multi-modular tool (that is, a tool using more different modules) that allows
the user to query a corpus of Latin texts.
We focused our attention and
interest on Latin: the same can, obviously, be done even for Old-Norse and
other languages.
The kinds of query we'd like to be
able to answer to are, at least, the following:
¯ on a merely morphological level: for
instance, the user can know all the wordforms inflected as first declension,
singular genitive in -ai nouns occuring in the texts of Cicero. The homographs are not
disambiguated;
¯ on a morpho-syntactic level: the
homographs are disambiguated. The user can know where and if a partucular kind
of syntactic structure occurs in the texts of Cicero;
¯ on a semantic level: the user
searchs for the word love (in English!) and obtains as an answer all the lemmas whose semantic
definition contains love as first, second, third,... meaning, metaphorical use, technical
use...;
¯ statistics: each lemma is
accompained by its use frequency in the corpus (structured per author, age,
book, style of the book,...). Each wordform is bound to its morphological (no
disambiguation of the homographs) and morpho-syntactic frequency in the corpus.
Each lemma is part of a "semantic family" (SF) and a
"morphological family" (MF): an SF contains all the lemmas having a
common meaning in the definition; an MF contains all the lemmas have a common
stem in the stemming procedure;
¯ Greek-Latin relationship through
English: all the Latin lemmas are related to the corrispondent Greek lemma
(linked), selected through the common meaning in the dictionary.
The general structure of the
analysis of a text is the following (the example is about Latin, but is
suitable even for other languages):
1.
Input
Latin text (from the CHLT corpus),
2.
Morphological
analysis (CHLT LEMLAT),
3.
Morpho-syntactic
analysis (Stemming and Syntactic Parser),
4.
Dictionary
entry (lemma) with (a) statistical information, (b) structured semantic
description (SF and MF) and (c) link to Greek dictionary.
Here there is a possible list of the
Workpackages in a no-gerarchic order: