Workpackage 5: Neo-Latin Morphological Analyzer



Year 2 Progress Reports

June 2003 Ð May 2004


Executive Summary




During the second year of work, the WP5 kept on the development of a new version of the Latin morphological analyser LEMLAT, adding new information about the input word forms. A demo of this new version (named CHLT LEMLAT) and the source code of the program are available on the CHLT website (


Particularly, the following results have been accomplished in order to develop CHLT LEMLAT:

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories;

¯    coding of SF;

¯    modifications on the LES archive, caused by problems in adding gender codes and coding SF;

¯    a MySQL database for the management of the LES archive has been designed and implemented;

¯    implementation of the algorithm for the complete morphological analysis of the wordforms with structure LES + SF;

¯    some procedures for the use of the database by LEMLAT modules have been implemented;

¯    some applications for specific handlings of the informations contained in the LES archive have been implemented;

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories;

¯    building up a client version of LEMLAT;

¯    building up and testing LEMLAT for LINUX platform;

¯    reorganising LEMLAT source code;

¯    coding of the SM and management of the wordforms with structure LES+SM+SF;

¯    coding of FE;

¯    coding as FE of the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs;

¯    coding of the N, V, PR LES;

¯    coding of the P1-P9 and P18 LES;

¯    management of I LES;

¯    coding of the Type of the Òno f TypeÓ adjectives;

¯    FE management;

¯    design of a general rule for the management of LE; in particular, some special rules have been designed in order to solve the problem of the contrast between SF coding and LE rule and to solve some exceptions;

¯    tables for Initial graphical variations and post-final segments;

¯    identification of the morphological values to be attributed to the LE of COD LES N6*, N7* and Pluralia Tantum;

¯    management of N, V, PR, P1-P9 and P18 LES;

¯    management of irregular Type;

¯    management of the present participles, past participles, future participles and  irregular gerundives;

¯    output re-organization and XML format;

¯    writing of four articles;

¯    implementation of a library in C standard language that can be linked in several applications;

¯    implementation of a consolle interactive application;

¯    implementation of a CGI application currently running on CHLT website at the WP5 page (;

¯    implementation of an application for the morphological analysis of a text, producing output in a predefined format suitable both for visualisation and rielaboration.



Next steps to reach by the end of the project


¯    testing CHLT LEMLAT lemmatization results: a number of Latin texts (and/or singular wordforms) will be submitted to CHLT LEMLAT. The results will be analitically checked in order to find out possible mistakes;

¯    source code testing and validation;

¯    source code documentation in order to help developers to bulid up specific applications;

¯    definition and documentation of CHLT LEMLAT results: producing a ÒTEI compliantÓ DTD specifically designed with morphological elements and attributes;

¯    implementation of a web-based application for text, or text fragments morphological analysis.



Insights for the future (CHLT 2?)

The aim should be the development of a multi-modular tool (that is, a tool using more different modules) that allows the user to query a corpus of Latin texts.

We focused our attention and interest on Latin: the same can, obviously, be done even for Old-Norse and other languages.

The kinds of query we'd like to be able to answer to are, at least, the following:

¯    on a merely morphological level: for instance, the user can know all the wordforms inflected as first declension, singular genitive in -ai nouns occuring in the texts of Cicero. The homographs are not disambiguated;

¯    on a morpho-syntactic level: the homographs are disambiguated. The user can know where and if a partucular kind of syntactic structure occurs in the texts of Cicero;

¯    on a semantic level: the user searchs for the word love (in English!) and obtains as an answer all the lemmas whose semantic definition contains love as first, second, third,... meaning, metaphorical use, technical use...;

¯    statistics: each lemma is accompained by its use frequency in the corpus (structured per author, age, book, style of the book,...). Each wordform is bound to its morphological (no disambiguation of the homographs) and morpho-syntactic frequency in the corpus. Each lemma is part of a "semantic family" (SF) and a "morphological family" (MF): an SF contains all the lemmas having a common meaning in the definition; an MF contains all the lemmas have a common stem in the stemming procedure;

¯    Greek-Latin relationship through English: all the Latin lemmas are related to the corrispondent Greek lemma (linked), selected through the common meaning in the dictionary.


The general structure of the analysis of a text is the following (the example is about Latin, but is suitable even for other languages):

1.     Input Latin text (from the CHLT corpus),

2.     Morphological analysis (CHLT LEMLAT),

3.     Morpho-syntactic analysis (Stemming and Syntactic Parser),

4.     Dictionary entry (lemma) with (a) statistical information, (b) structured semantic description (SF and MF) and (c) link to Greek dictionary.


Here there is a possible list of the Workpackages in a no-gerarchic order:

¯    WP1: development of the actual CHLT corpus of Latin texts (we need even more texts);

¯    WP2: development of CHLT LEMLAT. We need:

o      a wider lexical basis, in order to cover at least the medieval lexical extension and the proper names (Onomasticon),

o      for the stemming, to reduce the number of LES, adding lists of affixes and, thus, of rules of morphological derivation. For instance, design a corpus of rules such as the one that creates adjectives in -bilis from verbs: amabilis);

¯    WP3: a syntactic parser (to disambiguate the homographs);

¯    WP4: to extract statistical information form the CHLT corpus;

¯    WP5: structuring the semantic description of the lemmas in the dictionary and Greek-Latin linking.


The results of such a multi-modular tool can be applied in a more general framework. Particularly, the following fields seem to be the most suitable ones:

¯    Education: e-learning,

¯    Digital libraries: information retrieval from Latin texts in digital format,

¯    Research: linguistics, lexicography, grammatical theoriesÉ


Year 1 Executive Summary



In the context of the CHLT project, the task of the Workpackage 5 is to create a Neo-Latin Morphological Analyser. The people involved are Andrea Bozzi (senior researcher), Giuseppe Cappelli (senior technician), Marco Passarotti (young researcher) and Paolo Ruffolo (young researcher). From 1st september, 2002 to 31th August, 2003, the work done by Marco Passarotti has been paid not with CHLT funds, but with a CNR grant.


This document summarizes what has been largely described in the two Progress Reports produced since the beginning of the project by the Workpackage 5.[1]:: all the technical terms here used are explained in these two reports and, because of this, are not described anymore in these pages.


In the first part of this document, the main achievements of the project are summarized up to May 2003.


In the second part, the next steps, to be achieved by the end of the second year of the project, are described.


In the third part, the dissemination of the project by WP5 (through articles, lessons, papers and posters at congresses) is summarized.


In the conclusion, the necessity of LEMLAT in the IST context, and its perspectives, in view of an evolution of its analysis of latin, are given.


1.    Main achievements of the project (up to May, 2003)


From the beginning of the project to May, 2003, the main achievement of the Workpackage 5. are the following:


¯    after an evaluation period, LEMLAT has been chosen as an automatic lemmatization tool to be developed for CHLT requirements;


¯    the analysis of the input wordforms done by LEMLAT has been studied, in order to find out a way to add on the output the new morphological informations required in the CHLT context; an algorithm has been written for the analysis of the wordforms with structure LES + SF;


¯    the CHLT LEMLAT analysis required a number of informations to be coded on the SF elements. The codes to be added to each SF have been decided according to the reccomandations developed in the context of the international morpho-syntactic coding standard EAGLES. The list of the codes used for LEMLAT and the problems of the use of them in a merely morphological context con be read in the first Periodic Progress Report of the Workpackage 5. (December, 2002).

Choosing EAGLES as the coding standard to be applied in CHLT LEMLAT allows a large applicability of the resulting morphological analyser: this implies that, in the context of IST, LEMLAT can be a very useful tool, to browse and search large latin corpora, either on the web, or on stand-alone tools.


¯    the SF (endings) related to the nominal, adjectival, paticiple, verbal inflexions have been coded. This coding have been tested on the LEMLAT results. The total number of the inserted codes is 27.144;


¯    we started adding the gender codes to the LES belonging to ambiguous morphological categories. 10812 codes have been inserted up to now (on a total of 20.984);


¯    the coding of the FE (exceptional forms) have been finished. 46740 codes have been inserted (10 codes for eache FE). The FE needed to be coded one by one, because the previous version of LEMLAT did not segment them, not allowing, as a conseguence, the use of the algorithm designed for the wordforms segmented as LES + SF.


¯    the coding of the LES with COD LES N (undeclinated nouns), I (invariable lemmas), V (verbs of a not specified conjugation), PR, P1, P2, P3, P4, P5, P6, P7, P8, P9, P18 (different kinds of pronouns) has been finished. 15840 code have been inserted.


¯    the informatic management of all the data related to the not segmented wordforms (FE, N, I, V, PR, P1-P9, P18) is under construction.


¯    a basic software system has been implemented in language C and the results, produced on a specific banchmark, have been checked. Moreover, a MySQL database for the management of the LES archive has been planned and implemented;


¯    we made some modifications on the LES archive. During the coding work we found out that some previous coding decisions were unfitable for the CHLT LEMLAT new functions (and formalism of analysis). Thus, we created new COD LES and designed new groups of LES, morphologically homogeneous. The most important and difficult modification belongs to verbal conjiugation: see details in the second Periodic Progress Report (March, 2003);


¯    we implemented a client version of LEMLAT.;


¯    we compiled and tested LEMLAT for LINUX platform;


¯    we reorganised LEMLAT code: a C static library plus I/O structures and functions;


¯    we added the SF table on LEMLAT database and developed specific C functions to menage such table;


¯    we modified SF information management in LEMLAT code and implemented the algorithm for the complete morphological analysis of the wordforms with structure LES + SF;


¯    we evaluated and used some tools for a user friendly management of the LES archive and for the LEMLAT database in general;


¯    we built up a cgi version of LEMLAT and tested it both on LINUX server and on WINDOWS server. The URL where the actual results of LEMLAT can be seen is:

The website is still very simple: in the home page, the user can write a latin wordform to be analyzed. Then, with a click on the ÒLemmatizeÓ button, the analysis starts, giving the results. The results of the lemmatization show the lemma(s) followed by the sequence of the EAGLES codes, saying the morphological value(s) of the analysed wordform. Since the user is not able to understand the semantics of these codes, each one of them is esplicitly explained in a number of boxes. For each code the corresponding attribute and value is given.

In the next future, we will put on the home page:

¯    since the actual version of CHLT LMLAT on line is in italian, a link to the english version;

¯    a link to all the official documents (Reports, DeliverablesÉ) produced by the Workpackage 5;

¯    a link to the list of the EAGLES code used in CHLT LEMLAT, each one explained;

¯    since LEMLAT is an open-source tool, a link to all the documents related to the LEMLAT dictionary and to its source code. While the work on LEMLAT is still going on, the access to these data will be limited to users having a password.

The actual version on-line of CHLT LEMLAT covers the analysis of all the wordforms with the structure LES + SF.

CHLT LEMLAT still does not cover the analysis of the following items:

Reason of the not covering: the not segmented wordforms have to be analysed with an algorithm different from the one of the wordforms with structure LES + SF. The linguistic informations on each not segmented wordform has been already coded in some tables: what is still missing (and is now under development) is the informatic management of all this information. The new algorithm of analysis has been designed and will be tested in the next months. The main steps of this algorithm are the following:

a.     Receive as input a wordform: abaddier

b.     LEMLAT analyses it (that is to say ÒlemmatizesÓ) with no segmentation: abaddier-

c.     Search for this wordform in the table of the LES with COD LES of one of the following: FE, V, N, I, P1-P9, P18

d.     If no items are found, stop the analysis

If an item is found (abaddier is found in the table of the FE), attach on the output the codes related to that item: NcCÑnms-- (third declension noun, masculine, singular, nominative)

e.     Keep on reading the table where the item has been found. Find if there are other raws bearing the same item. If yes, attach on the output the codes related to all the items found. This must be done because a wordform can be analysed in more than one way: in the tables where the morphological values of the not segmented wordforms are recorded, there is one raw for each value, with the coding of that value. For instance, the FE abaddier is recorded on two raws, because has two different morphological values (NcCÑnms--: third declension noun, masculine, singular, nominative and NcCÑvms--: third declension noun, masculine, singular, vocative)

Reason of the not covering: the second position code in the EAGLES standard belongs to the Type of the PoS. For the Adjectives, this information must be added manually, on the dictionary, to each LES with PoS A (Adjective). Find the list of the Type values in the first WP5 Periodic Progress Report (December, 2002)

Reason of the not covering: in the previous LEMLAT version, gerunds, gerundives and participles were all coded in the same category of the adjectives (N6 and N7). We coded the SF related to this category with PoS A (Adjective), Type to be defined, no Flexive Category, no Mood, no Tense, Case, Gender, Number.

But, gerunds, gerundives and participles need to have PoS V (verb), Type m (Main), and, in addition to Case, Gender, Number, also Flexive Category (depending on the lemmaÕs one), Mood and Tense. In order to give this informations on the output, we are writing an algorithm, whose foundamental steps are the following:

a.     Receive as input a wordform: amatorum

b.     LEMLAT recognises in it a LES, a SM (segmento mediano) and an SF: am-at-orum

c.     LEMLAT creates two lemmas

a.     A lemma N6, or N7: amatus N6

b.     A lemma V: amo v1

d.     On the output:

a.     Paste the codes of Case, Gender, Number from the ones of the SF:

SF orum N6:

                                                             i.     Genitive, plural, masculine

                                                               ii.     Genitive, plural, neuter

b.     Write the codes Vm (Verb, Main) in the first two positions

c.     Write the code of Flexive Category (third position), according to the one of the lemma V: v1 means F (verb of the first conjugation)

d.     Write the codes of mood and tense, according to the SM appearing in the middle of the segmented wordform: at means k (passive participle), 4 (perfect)

Thus, the wordform amatorum is analyzed as: Verb, main, I conjugation, passive participle, perfect, genitive, masculine and neuter, plural

Reason of the not covering: we are testing algorithms such as the one described above about gerunds, gerundives and participles.



3. Dissemination of the project


3.1 Presentations and lessons about LEMLAT

CHLT LEMLAT has been presented in the following occasions:

Exploratory workshop on Computer texts: documentation, linguistic analysis and interpretation, organized by the Standing Committee for the Humanities of the European Science Foundation (A. Bozzi and A. Raggioli, Strasbourg, 14-15/6/02)


XIV Round Table on Computer-aided Egyptology (A. Bozzi, Pisa, 8-10/7/02)


Seminars at the Classical Studies Dpt., Faculty of Letter, Lisboa University, on e-philology (A, Bozzi, Lisbona, 29 e 30/7/02)


Seminar on Progettare il digitale. Tecnologie per i beni librari: conservazione e fruizione in una biblioteca digitale (A. Bozzi, Firenze, 1/10/02)


International congress on Francesco Maurolico e le matematiche del Rinascimento: l'edizione critica dei testi scientifici e la sfida delle nuove tecnologie (A. Bozzi, Messina, 16-19/10/02)


Seminar on Gestione e fruizione di immagini digitali per le biblioteche e gli archivi, organized by Centro di Ateneo per le Biblioteche dell'Universitˆ degli Studi di Padova (A. Bozzi and A. Raggioli, Padova, 28/10/02)


Seminar at Istituto Nazionale di Studi sul Rinascimento (A. Bozzi and A. Raggioli, Firenze, 4/11/02)


Lesson at the Romanisches Seminar, Berlin Freie Universitaet (A. Bozzi and M.S. Corradini, Berlin, 27/11/02)


Lesson at the Computational Linguistics course, Milano, Universitˆ Cattolica del Sacro Cuore (M. Passarotti, 4/3/03)


International Colloquium on Antiguidade Cl‡ssica: Que fazer com este Patrim—nio?, Universitade de Lisboa (G. Cappelli, M. Passarotti, 8/5/03)


3.2 Publications

Cappelli Giuseppe, Passarotti Marco, LemLat: uno strumento computazionale per lÕanalisi linguistica del latino. Sviluppo e prospettive, in ÒEuphrosyneÓ, Vol. XXXI, 2003


3.3 Next publications and presentations

Poster at the XII International Colloquium on Latin Linguistics, Universitˆ di Bologna (M. Passarotti, June 2003)


Article in the proceedings of the International Colloquium on Antiguidade Cl‡ssica: Que fazer com este Patrim—nio? (G. Cappelli, M. Passarotti, to be published in 2003)


4. Conclusion


At the end of its development, CHLT LEMLAT will be a very useful tool to analyse and filter big latin corpora, covering a wide range of time in the history of this language.

There is, in fact, an urgent necessity of management of large corpora, in view of a new information society, where the users can access on-line many documents and, thus, need to filter their linguistic contents, first of all lemmatizing them.

At the moment, no latin lemmatizer is so skillful that can manage so many lemmas as CHLT LEMLAT could do.

The most important thing is that a powerful lemmatizer means a powerful basis for a good syntactic disambiguator: this tool, receiving as input a text, reads the wordforms in the syntax and chooses the correct analysis of the wordforms, between the ones given by the lemmatizer. For instance, the wordform puella is analysed by the lemmatizer in three possible ways (noun, common, first declension, singular, feminine, nominative, vocative and ablative): but, in a syntactic context, only one of these values is correct: task of a syntactic disambiguator will be to choose the correct one.

Other perspectives of CHLT LEMLAT are:

¯    A latin lexical database, where to the lemmas are added statistical informations, images and sounds (where possible), translation, etimology, length of the syllablesÉ

¯    Building homogeneous groups of lemmas, according to morphological relativity (morphological families: tema) and semantic affinity (semantic families: semantema);

¯    Adding an onomasticon and new lemmas in the dictionary.



Workpackage Progress Reports for Year 2



CHLT Project

1 June - 31 August 2003


Workpackage 5: Neo-Latin Morphological Analyser

Istituto di Linguistica Computazionale  C.N.R.


Andrea Bozzi

Giuseppe Cappelli

Marco Passarotti

Paolo Ruffolo




1. Summary of key indicators of project progressÉÉÉÉÉÉÉÉÉÉ...2-9

1.1  Overall assessment of the main milestones achieved

1.1.1      Gender coding

1.1.2      SM coding

1.1.3      Adverbial use as FE

1.1.4      N, V, PR LES coding

1.1.5      P1-P9 and P18 LES coding

1.1.6      Management of I LES

1.1.7      Coding of Type

1.1.8      FE management

1.2  Problems encountered and decisions taken

1.2.1      Gender coding

1.2.2      SM coding

1.2.3      Adverbial use as FE

1.3  Correspondence between planned project progress and actual accomplishments


2. Work progress overviewÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉ10-11

2.1 Specific objectives for the reporting period

2.2 Achievements

2.2.1 List of Deliverables

2.2.2 Progress by Workpackage/task

2.3 Work planned for the next reporting period


3. Project ManagementÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉ.É.12

3.1 Contractual Issues

3.2 Co-operation within the consortium

3.3 Participation in workshops and/or conference, publicationsÉ


4. Technical annexesÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉ13-52

4.1 Coding of N, V, PR LES

4.2 Coding of P1-P9, P18 LES

4.3 Coding of the Òno f TypeÓ adjectives

4.4 SM coding

4.5 Adverbial use coding


1. Summary of key indicators of project progress


This report concerns the activities realized by the Workpackage 5 in the period from 1st June 2003 to 31st August 2003.


1.1 Overall assessment of the main milestones achieved


1.1.1 Gender coding

In the LES archive of the previous version of LEMLAT, lemmas of a different gender could be coded as belonging to the same paradigmatic category, because the gender did not come into the morphological information of the output: for example, both masculine (nauta, pirata) and feminine nouns (rosa, absentia) belong to the category n1 (first declension nouns, masculine and feminine gender).

For the CHLT LEMLAT requirements, the gender information is needed on the output.

In the period covered by this report, we finished the gender coding: 19.600 gender codes have been applied.


1.1.2 SM coding

An SM (Segmenti Mediani) is an element occurring in the middle part of a wordform[2]. Instances of SM are and in am-and-us, ant in am-ant-em, ior in pulchr-ior-em.

In the previous version of LEMLAT the SM archive was as follows:


SM    Left COD LES     Right ending code

ant   v1               n7

ans   v1               blk

and   v1               n6  n21 n2n

ent   v2  v3  v6       n7

ens   v2  v3  v6       blk

end   v2  v3  v6       n6  n21 n2n

und   v3  v6           n6  n21 n2n

ient  v4  v5           n7

iens  v4  v5           blk

iend  v4  v5           n6  n21 n2n

iund  v4  v5           n6  n21 n2n

ior   n6  n7           n7c blk


The structure of the SM archive was the following:


a.     SM;

b.     COD LES compatible on the left with the SM. For instance, the SM ant can attach, on the left side, to LES with COD LES v1 (first conjugation verbs): an example is the wordform am-ant-em, where am is a LES with COD LES v1;

c.     ending inflexion code(s) compatible on the right with the SM. For instance, the SM ant can attach, on the right side, to endings with code n7 (endings of the second class adjectives): in the wordform am-ant-em, em is an ending with code n7.

In the period covered by this report, a coding of the SM elements have been done, in order to analyse the input wordforms where an SM is involved.


1.1.3 Adverbial use as FE

In addition to the regular values, some adjective wordforms ending in -um, -i, -o, -a are also used as adverbs.

For instance, the wordform multo (lemma: multus) is a singular, masculine and neuter dative and ablative, but is also used as an adverb.

In the period covered by this report, we found a way to analyse wordforms such as multo also as adverbs.


1.1.4 N, V, PR LES coding

The LES with the following COD LES are wordforms analysed with no segmentation:

This implies that they cannot be analysed using the coding of SF, and/or SM as source of the information needed on the output:

Like all the wordforms analysed with non segmentation (for instance, the FE), the morphological values have been applied to each wordform on its own.

In the annexe 1, the coding of the N, V, PR LES is reported.


1.1.5 P1-P9 and P18 LES coding

The wordforms formed with a LES with COD LES P1, P2, P3, P4, P5, P6, P7, P8, P9, P18 are analysed with segmentation, but the information needed on the output cannot be delivered by the SF. In fact, in these cases, the SF is not bringer of any of the morphological information needed: otherwise, they are brought by the LES itself.

In these cases, in fact, we are facing with SF like libet (aliqui-libet), piam (quis-piam), cumque (qui-cumque), dam (qui-dam). Thus, in a wordform like quem(LES)-cumque(SF), the morphological values (masculine, singular, accusative) are delivered not by the SF cumque, but by the LES quem. This implies that these wordforms must be analysed differently than the other wordforms segmented LES+SF (puell-am), where the morphological values are delivered by the SF (am: feminine, singular, accusative).

In order to obtain the analysis of these wordforms, we coded the morphological values on each LES with COD LES P1, P2, P3, P4, P5, P6, P7, P8, P9, P18: in this way, the analysis of quemcumque results from the following steps:

In the annex 2, the coding of the P1-P9, P18 LES is reported.


1.1.6 Management of I LES

The wordforms of LES with COD LES I (invariable; 1804 LES) are analysed with non segmentation and receive automatically the codes X--------- (invariable).


¯    Input wordform: abs

¯    abs- is analysed with no segmentation

¯    abs is found in the LES archive having the COD LES I

¯    abs is lemmatized with lemma a

¯    abs receives the morphological codes X---------


1.1.7 Coding of Type

The Type of the adjectives with Type different than f (qualitative) has been added manually. The main number of adjectives are qualitative: this allowed us to apply automatically this value to all the adjectives. Then, we coded manually on the list of adjectives the ones of Òno f TypeÓ: they are possessive, numeral, personal, indefinite,É

In the annex 3, the coding of the Òno f TypeÓ adjectives is reported.


1.1.8 FE management

A specific table has been designed and implemented to store the information needed to lemmatize the FE. Each FE LES (i.e. LES containing the value FE in the CODLES field) has been linked with linked with one or more entry of FE table.

We implemented a set of functions that retrieve the necessary information from FE table and put it in the output data set. We were particularly concerned in avoiding to output redundant information: some wordform must get information both from the ÔnormalÕ analysis and from the FE analysis.

Here is a list of the C functions:

The style used is similar to other functions (and data structure) used to interact with the database in order to allow an easy code understanding and modification.

We also verified that the management used for the FE wordform can be used also for other LES (e.g.  N, V, PR LES).


1.2 Problems encountered and decisions taken


1.2.1 Gender coding

In order to code the gender information, we marked with gender codes all the nominal LES. We did it manually for the nominal LES coming into ÒambiguousÓ category (for example, n1), that is to say the ones including LES of more than one gender; on the contrary, we could assign automatically the gender codes to the nominal LES coming into ÒunambiguousÓ categories, that is to say the ones including LES of one gender only (for example, n2n: second declension nouns, neuter gender).

In the period covered by this report, the gender coding has been finished. We applied 19.600 gender codes, belonging to the following ÒambiguousÓ categories[3]:

¯    n1 (first declension nouns, masculine and feminine gender)

5324 LES:

o      38: masculine and feminine

o      4982: feminine

o      304: masculine

¯    n1e (first declension exceptional nouns, masculine and feminine gender)

829 LES

o      7: masculine and feminine

o      463: feminine

o      359: masculine

¯    n2 (second declension nouns, masculine, neuter and feminine gender)

2481 LES

o      5: masculine and feminine

o      102: feminine

o      2365: masculine

o      9: neuter

¯    n2e (second declension exceptional nouns, masculine, feminine and neuter gender)

1358 LES

o      3: masculine, feminine and neuter

o      109: masculine and neuter

o      11: masculine and feminine

o      30: neuter and feminine

o      145: feminine

o      403: masculine

o      657: neuter

¯    n3 (third declension nouns with plural genitive Ðum/-ium, masculine, feminine and neuter gender)

187 LES

o      1: masculine, feminine and neuter

o      5: masculine and neuter

o      3: masculine and feminine

o      98: feminine

o      78: masculine

o      2: neuter

¯    n31 (third declension nouns with plural genitive Ðum, masculine, feminine and neuter gender)

7289 LES

o      1: masculine and neuter

o      62: masculine and feminine

o      4713: feminine

o      2511: masculine

o      2: neuter

¯    n32 (third declension nouns with plural genitive -ium, masculine, feminine and neuter gender)

458 LES

o      29: masculine and feminine

o      289: feminine

o      138: masculine

o      2: neuter

¯    n3e (third declension exceptional nouns, masculine, feminine and neuter gender)

505 LES

o      4: masculine and neuter

o      19: masculine and feminine

o      3: neuter and feminine

o      272: feminine

o      158: masculine

o      49: neuter

¯    n4 (fourth declension nouns, masculine, feminine and neuter gender)

1056 LES

o      1: masculine, feminine and neuter gender

o      2: masculine and neuter

o      7: masculine and feminine

o      32: feminine

o      1008: masculine

o      6: neuter

¯    n5 (fifth declension nouns, masculine and feminine gender)

113 LES

o      4: masculine and feminine

o      109: feminine


The ÒunambiguousÓ categories to which the gender code has been assigned automatically are the following:

¯    n2i: second declension nouns, masculine gender,

¯    n2n: second declension nouns, neuter gender,

¯    n2ni: second declension nouns ending in -ium, neuter gender,

¯    n3n: third declension nouns with plural genitive in Ðum/-ium, neuter gender,

¯    n3n1: third declension nouns with plural genitive in Ðum, neuter gender,

¯    n3n2: third declension nouns with plural genitive in Ðium, neuter gender,


The gender codes we established so far are the following:

¯    m: masculine

¯    f: feminine

¯    n: neuter

¯    1: masculine and neuter

¯    2: masculine and feminine

¯    3: neuter and feminine

¯    *: none[4]


1.2.2 SM coding

The input wordforms segmented by LEMLAT with the structure LES+SM+SF (am-and-us) are analysed by CHLT LEMLAT through a synergy between the information brought by the SM and the ones brought by the SF occurring in the input: this means that, on the output, some information comes from the coding of the SM, some others from the coding of the SF.

For instance, in the wordform amandus, segmented am(LES)-and(SM)-us(SF), the SM and brings the information about the PoS (verb), Type (main), Flexive Category (first) and Mood (gerundive), while the SF us brings the ones about Case (nominative), Gender (masculine) and Number (singular). The sum of this information is the resulting analysis of amandus.

In order to code on each SM which positions have to be filled with the information coming from the SF, the code = has been used. The code = means: in the final analysis of the input wordform, the code that must appear in this position comes from the coding of the SF occurring in that wordform.

For instance, the steps done by CHLT LEMLAT for the analysis of the wordform amandus are the following:

-Input: amandus

-Lemma: amo

-Segmentation: am(LES)-and(SM)-us(SF)

-SF us n6 codes: Af---nms-1

-SM and v1/n6[5] codes: Vmfr-===--

-Resulting codes: Vmfr-nms--

-Codes conversion: Verb, Main, I Conjug., Gerundive, Nomin., Masc., Sing.

In the technical annexe 4., the SM coding file is reported.


1.2.3 Adverbial use as FE

In order to analyse wordforms such as multo also as adverbs, we could not code this value on the SF: in the case of multo, in fact, if the value ÒadverbÓ had been coded on the SF o, we would have had this value on the output analysis of all the input first class adjectives ending in o. For instance, the adjective pulchro would have been analysed also as an adverb.

Thus, according to the dictionaries, we coded as FE (exceptional wordforms) the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs.

In this way, receiving in input a wordform such as multo, LEMLAT applies on the output analysis the regular values coming from the SF coding and the adverb value coming from the coding of multo as FE.

The total of the involved wordforms is 166: they are reported in the technical annexe 5.



1.3 Correspondence between planned project progress and actual accomplishments


The progresses done in Workpackage 5 in the period from 1st June 2003 to 31st August 2003 respect what planned in the Project Program.

In particular, they are the following ones:

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories (finished);

¯    coding of the SM and management of the wordforms with structure LES+SM+SF (finished);

¯    coding as FE of the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs (finished);

¯    coding of the N, V, PR LES (finished);

¯    coding of the P1-P9 and P18 LES (finished);

¯    management of I LES;

¯    coding of the Type of the Òno f TypeÓ adjectives (finished);

¯    FE management

2. Work progress overview


2.1 Specific objectives for the reporting period

During the period covered by this report, we continued the development of LEMLAT in CHLT LEMLAT, following two paths:

  1. adding the LES information that are missing in LEMLAT:
    1. gender coding
    2. Òno f TypeÓ adjectives coding
  2. finding and developing a way to analyse wordforms with structure different than LES + SF:
    1. SM coding and management of the wordforms with structure LES + SM +SF
    2. no segmented wordforms coding and management:

                                                     i.     FE management

                                                      ii.     N, V, PR LES coding

                                                        iii.     P1-P9 and P18 LES coding

                                                        iv.     I LES management

In addition, we decided to manage the analysis of the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs, coding them as FE.



2.2 Achievements


2.2.1 List of Deliverables

December, 2002: Periodic Progress Report

March, 2003: Periodic Progress Report

June, 2003: D 5.1


2.2.2 Progress by Workpackage/task

According to the specific appointed targets, the phase of the work in Workpackage 5 covered by this report has produced the following results:

¯    adding of the gender codes to the LES belonging to ambiguous morphological categories (finished);

¯    coding of the SM and management of the wordforms with structure LES+SM+SF (finished);

¯    coding as FE of the adjective wordforms ending in -um, -i, -o, -a that are used as adverbs (finished);

¯    coding of the N, V, PR LES (finished);

¯    coding of the P1-P9 and P18 LES (finished);

¯    management of I LES;

¯    coding of the Type of the Òno f TypeÓ adjectives (finished);

¯    FE management

¯    Preliminary study for the management of N, V, PR, P1-P9 and P18 LES



2.3 Work planned for the next reporting period

The work planned for the next reporting period is the following:

¯    implementation of new LE rules;

¯    testing the lemmatization results about the wordforms with structure LES + SF;

¯    testing the lemmatization results about the FE;

¯    management of N, V, PR, P1-P9 and P18 LES

¯    management of the lemmatization of the wordforms with structure LES + SM + SF.


3.         Participation in workshops and/or conference, publications


Poster at the XII International Colloquium on Latin Linguistics, Universitˆ di Bologna (M. Passarotti, June 2003). Th publication of an article in the Conference Proceedings is forthcoming.


Techical annexes

