Workpackage 5: Neo-Latin Morphological Analyser

Andrea Bozzi, Marco Passorotti, Paolo Ruffullo

ILC Pisa

 

Year 3 Executive Summary

May, 2005

 

 

 

In year 3 we completed our CHLT work on the Neo-Latin Lemmatizer focusing on five areas: (i) management of non-segmented word-forms, (ii) writing DTD's for CHLT-LEMLAT, (iii) creating a reference manual for the use of CHLT-LEMLAT, (iv) integration of LEMLAT into the CHLT-Perseus Digital Library System, and (v) development of future work with CHLT-LEMLAT.

 

The modifications to the Lemmatiser that took place in Year 3 are the following:

a.     Adding of the gender codes to the LES belonging to ambiguous morphological categories

b.     Implementation of new algorithms, for the management of not segmented wordforms

c.     implementation of new algorithms, in order to analyse worforms with structure LES + SM +SF

d.     coding of the Type of each adjectival LES

e.     testing the lemmatization results about the wordforms with structure LES + SF

f.      continuing code source modification in order to make it more clear and easy to modify

g.     documentation of the implemented functions, data structures and algorithms

h.     development of automatic morpho-syntatctic disambiguator for a semi-automatic morpho-syntactic lemmatization

i.      adding an Onomasticon in LEMLAT lexical basis

j.      structuring LEMLAT lexical basis according to Word Formation Rules

k.     developing a user-friendly lexicographic workstation for LEMLAT disambiguation

l.      creation of Latin Lexical Database, in which each LEMLAT lexical entry is related to its dictionary entry

 

Our CHLT work has transformed the way scholars can work with Latin texts in the following ways:

 

(i)        Managing Latin texts in electronic form which provides automatic morphological

lemmatisation

 

(ii)       Ability to add new information to LEMLAT lexical basis (adding of lemmas)

 

(iii)      Ability to modify LEMLAT source code for personal purposes

 

(iv)      Ability to modifying LEMLAT morphological codes for personal purposes

 

            (v)       Greater integration between Cultural Heritage documentation in Latin

texts and ICT tools and applications

 

            (vi)      Implementation of open source versions of software, which were previously

available under licence

            (vii)     Greater collaboration between centres of excellence in the US and Europe

in the study of ancient texts and the development of ICT tools for digital scholarship.

 

Conclusions of CHLT WP4

 

CHLT- LEMLAT is a useful tool for analysing and filtering large Latin corpora, covering a wide historical period in the history of this language. It fills an urgent need to find ways of managing large corpora of this kind in a digital environment where users can access a multitude of documents on-line, but have no way of filtering their linguistic content. CHLT-LEMLAT offers for the first time a way lemmatising Latin corpora for the purposes of sophisticated linguistic analysis, and (at the moment) is the most powerful tool available anywhere in the world for the Latin language. The most important thing that CHLT-LEMLAT provides is a powerful lemmatizer that ensures a powerful tool for syntactic disambiguation: it receives the text as input, reads the word-forms in the syntax parser and chooses the correct analysis of the word-forms from those offered by the lemmatizer. For instance, the word-form puella is analysed by the lemmatizer in three possible ways (noun, common, first declension, singular, feminine, nominative, vocative and ablative): but, in a syntactic context, only one of these values is correct. The task of a syntactic disambiguator is to choose the correct one.

 

Future Work: Dissemination and Exploitation of Results

 

The aim should be the development of a multi-modular tool that allows the user to query a corpus of Latin texts, with the thought that it will stand as a paradigm for future work in other languages.

 

The kinds of query we'd like to be able to answer to are the following:

 

    Morhpological Queries: On a merely morphological level: for instance, the user can know all the wordforms inflected as first declension, singular genitive in -ai nouns occuring in the texts of Cicero. The homographs are not disambiguated;

 

    Morpho-syntactic Queries: the homographs are disambiguated. The user can know where and if a partucular kind of syntactic structure occurs in the texts of Cicero;

 

    Semantic Queries: on a semantic level the user searches for the word love (in English) and obtains as an answer of all the lemmas whose semantic definition contains love as first, second, third,... meaning, metaphorical use, technical use...;

 

    Statistical Levels: Each lemma is accompained by its use frequency in the corpus (structured per author, age, book, style of the book,...). Each wordform is bound to its morphological (no disambiguation of the homographs) and morpho-syntactic frequency in the corpus. Each lemma is part of a "semantic family" (SF) and a "morphological family" (MF): an SF contains all the lemmas having a common meaning in the definition; an MF contains all the lemmas have a common stem in the stemming procedure.

 

    Multilingual Queries: Greek-Latin relationship through English: all the Latin lemmas are related to the corrispondent Greek lemma (linked), selected through the common meaning in the dictionary.

 

The general structure of the analysis of a text is the following (the example is in Latin, but is suitable for other languages):

 

1.     Input Latin text (from the CHLT corpus),

2.     Morphological analysis (CHLT LEMLAT),

3.     Morpho-syntactic analysis (Stemming and Syntactic Parser),

4.     Dictionary entry (lemma) with (a) statistical information, (b) structured semantic description (SF and MF) and (c) link to Greek dictionary.

 

The division of possible Workpackages:

 

    WP1: development of the actual CHLT corpus of Latin texts (we need even more texts);

    WP2: development of CHLT LEMLAT. We need:

o      a wider lexical basis, in order to cover at least the medieval lexical extension and the proper names (Onomasticon),

o      for the stemming, to reduce the number of LES, adding lists of affixes and, thus, of rules of morphological derivation. For instance, design a corpus of rules such as the one that creates adjectives in -bilis from verbs: amabilis);

    WP3: a syntactic parser (to disambiguate the homographs);

    WP4: to extract statistical information form the CHLT corpus;

    WP5: structuring the semantic description of the lemmas in the dictionary and Greek-Latin

linking.

 

The results of such a multi-modular tool can be applied in a more general framework and be extended to the following areas:

 

    Education: e-learning,

    Digital libraries: information retrieval from Latin texts in digital format,

    Research: linguistics, lexicography, grammatical theories.


 

 

CHLT Deliverable 5.3: Documentation for Lemmatisation Module for Early Modern Latin (Month 30)

 

 

Reference Manual for CHLT-LEMLAT

 

 

LEMLAT

Wordforms analysis

Database description

 

 

 

Key to Codes

 

 

o      LES: the invariable part of the inflected forms;

 

o      SM (Segmento Mediano): the middle part of the inflected forms;

 

o      SF (Segmento Finale): the final part of the inflected forms;

 

o      SI (Segmento Iniziale): the initial part of the inflected forms;

 

o      SPF (Segmento Post Finale): a segment added on the right side of the final part of a wordform;

 

o      COD LES: it is the code assigned to each LES; each COD LES refers to a particular type of inflexion;

 

o      COD LEM: it is the code assigned to each output lemma; each COD LEM refers to a general type of inflexion;

 

o      FE (Forma Eccezionale): exceptional wordform. A wordform inflected in an exceptional way that cannot be regularly segmented and recognised;

 

o      LE (Lemma Eccezionale): exceptional lemma. A lemma created in an exceptional way that cannot be automatically created;

 

o      CLEM (Costellazione LEMmatica): contains all the LES related to a common lemma, or common dictionary entry; it is referred to through a unique N_ID

 

o      Ipolemma: intermediate lemma produced in output, not referring to a dictionary entry;

 

o      Iperlemma: lemma produced in output referring to a dictionary entry

 

o      N_ID: alphanumeric code applied to all the LES. More LES can share the same N_ID: all the LES related to a common lemma, or common dictionary entry are registered with the same N_ID (forming a CLEM)

 

o      CodLE: numeric code of LE, related to pattern(s) of 7 EAGLES codes bringing morphological information about the wordforms

 

o      EAGLES (Expert Advisory Group on Language Engineering Standards): standard coding of morphological, morpho-syntactic and semantic information of the words. In LEMLAT, 3 EAGLES codes are related to lemmas, and 7 to wordforms

 

 

 

 

 

 

 

Analysis of Wordforms

 

 

Receiving in input a wordform, if it is suitable to be analysed, LEMLAT produces in output:

 

-       the corresponding lemma(s);

-       a code expressing the inflexional paradigm of the lemma(s) (codlem)

-       the n_id of the lemma(s) CLEM (see table lessario)

-       3 EAGLES codes (converted by codlem) related to the lemma (one pattern of 3 EAGLES codes for each lemma produced in output), with information about (see cod_morf table):

o      P(osition)1: PoS

o      P2: Type (different possible types of each PoS; for instance, a noun can have Type common, or proper)

o      P3: Flexional Category (declension, conjugation,)

-       pattern(s) of 7 EAGLES codes related to the wordform, with information about (see cod_morf table):

o      P4: mood

o      P5: tense

o      P6: case

o      P7: gender

o      P8: number

o      P9: person

o      P10: degree

 

 

This analysis is obtained through a process of segmentation/recognition of input wordforms.

 

For each input wordform, LEMLAT operates a number of segmentation attempts.

When one of these attempts is found consistent with LEMLAT data about wordforms possible segments, the analysis of the wordform is performed in output.

 

There are three possible segmentation structures:

1.       LES + SF

2.       LES + SM + SF

3.       LES + SM + SM +SF

Each of these structures can be preceeded by a SI and followed by a SPF.

 

In addition to segmentation process, a wordform can be also recognised (and, thus, analysed) with no segmentation, in the following cases:

-        Input wordform is a FE

-        Input wordform is a LE

-        Input wordform is a les with codles i (invariables)

-        Input wordform is a les with codles n (uninflected nouns)

-        Input wordform is a les with codles v (verbs not related to a specific conjugation)

-        Input wordform is a les with codles pr, or p1-p9, or p18 (not segmented pronominals)

Also each of these structures can be preceeded by a SI and followed by a SPF.

 

A segmentation is valid if its segments are found as each other compatible (on left and/or on right side). The compatibility of the segments is coded along with the segments itself (see lessario, tabsf, tabsm, tabsi, tabspf tables).

For instance, a structure such as

LES + SM + SF

is found valid if:

-        left compatibility of SM corresponds to codles (that is, with right compatibility of LES)

-        right compatibility of SM corresponds with left compatibility of SF

 

In order to produce output information:

A)

If the input wordform is segmented:

-        lemma and codlem (3 EAGLES lemma codes): produced according to codles (see eagles table and annex 2-)

-        pattern(s) of 7 EAGLES wordform codes: from SF (and SM) coding (see tabsf and tabsm tables)

 

B)

If the input wordform is not segmented:

-        in case of LE:

o      codlem (3 EAGLES lemma codes): according to codles (see eagles table and annex 2-)

o      pattern(s) of 7 EAGLES wordform codes: from codLE (each LE is related to a codLE, that brings the seven EAGLES codes pattern(s) of the wordform; see cod_le and tabl_le tables)

o      lemma: LE itself (possibly, reduced to an iperlemma)

-        in case of LES with codles i:

o      patterns of 10 EAGLES codes (3 lemma codes + 7 wordform codes): 1-3 converted from codlem (see eagles table); 7-10 automatically assigned as -------

o      lemma: produced according to codles (see annex 2-), or to information related to concerned les on table lessario

-        in case of les with codles FE, n, v, pr, p1-p9, or p18:

o      pattern(s) of 10 EAGLES codes (3 lemma codes + 7 wordform codes): from hard-coding of each les with codles FE, n, v, pr, p1-p9, or p18 (see forme_ecc table)

o      lemma: produced according to codles (see annex 2-), or to information related to concerned les on table lessario

 

Each segmentation can produce analysis related to more than one lemma.

When a segmentation is found valid and the analysis is performed, LEMLAT does not stop the process, but produces other segmentation/recognition attempts: a wordform can be segmented (and analysed) in more then one way. Equally, the same wordform can be analysed through segmentation and through no-segmentation (see the case of a wordform showing homography between a regular segmented one and, for instance, a FE not segmented -).

 

The analysis of a wordform performed by LEMLAT can be summarised according to the following schema:


 


Database Tables

 

o      lessario

o      cod_le

o      cod_morf

o      eagles

o      forme_ecc

o      teb_le

o      tabsf

o      tabsm

o      tabspf

o      tabsai

o      tabsi

 

lessario

 

List of the les.

 

-        n_id

o      clem identification number

o      values:

       letter (first letter of the lemma)

       four numbers

-        gen

o      gender

o      values: see cod_morf table, field field_pos, value 7

-        clem

o      in a clem containing more than 1 les, identifies the les through which the lemma has to be created

o      values:

       v: identifies the les through which the lemma has to be created

       i: for superlative and comparative forms of irregular participle and irregular gerundive, the second lemma created (participle, or gerundive at positive degree) is an ipo- and not an iperlemma

       k: stops the creation of the iperlemma (value v is inhibited)

-        si (Segmento Iniziale)

o      initial alteration h

o      value:

       h: the les appears also with an initial h

-        smv (Segmento Mediano Verbale)

o      automatic insertion/exclusion of smv

o      values:

       +: adds a smv to the les, to automatically create the regular basis for perfect and future participle, and perfectum

       : adds a smv to the les, to automatically create the regular basis for comparative, superlative, present participle, gerund and gerundive

       blank: no smv to be added (irregular inflections)

-        spf (Segmento PostFinale)

o      adds/cuts a spf to les

o      values:

       3: exclusion of que (enclitic)

       see tabspf table, field comp_cod

-        les

-        codles

o      values: see annex -1-; see table eagles, field codles

-        lem

o      LE:

       a complete form

NOTE: in case of homography between two, or more lemmas, if the only difference among them is the length of a vowel, this is recorded in LE as follows:

       one quote () after the involved vowel: the vowel is short

       two quotes () after the involved vowel: the vowel is long

or

       a SF to be added to les

or

       =: the lemma is identical to the les

if more than one LE is concerned, the LE are divided by a slash

o      if no LE is recorded, the lemma is created through through automatically adding a SF to the les, rule depending on codles.; see annex 2-

-        s_omo

o      omographic lemma

o      values:

       A: omographic lemma A

       B: omographic lemma B

-        pi

o      more les in the same clem, but none with v in clem field

o      values:

       +

-        codlem

o      manually recorded if cannot be automatically assigned according to codles

o      see annex 3-; see table eagles, field codlem for the correspondance codles/codlem

-        type

o      manually recording of Type

-        codLE

o      in case of LE, exclusion of the 7-10 position codes in output patterns

o      values: see cod_le table

-        pt

o      pluralia tantum

o      values:

       x: exclusion of patterns with code s in position 8

-        a_gra

o      graphic alteration

o      values: see tabsai table

-        gra_u

o      les possibly divided in two parts

o      values

       x

-        notes

-        pr_key

o      identification number of the les

-        ts

o      Time Stamp: last time when the line has been modified

 

 

cod_le

 

List of codes and values for LE analysis.

 

-        cod_LE

o      codLE: in the analyis of an LE, adds the codes from c04 to c10. See cod_morf table for codes values

-        c04

o      codes in position 4

-        c05

o      codes in position 5

-        c06

o      codes in position 6

-        c07

o      codes in position 7

-        c08

o      codes in position 8

-        c09

o      codes in position 9

-        c10

o      codes in position 10

-        pr_key

o      identification number of the codLE

-        ts

o      Time Stamp: last time when the line has been modified

 

 

cod_morf

 

Description of codes/values/attributes occurring in the 10 positions output patterns.

 

-        field_pos

o      position in the pattern

o      values: 1-10

-        field_descr

o      description of the field value

-        value_descr

o      description of the attribute for each field

-        value

o      description of the code for each attribute/field

-        ts

o      Time Stamp: last time when the line has been modified

 

 

EAGLES

 

Conversion codles/codlem/1-3 position codes (lemma codes)

 

-        codles

o      codles list

-        codlem

o      codlem corresponding to codles recorded on the same line

-        c01

o      codes in position 1

-        c02

o      codes in position 2

-        c03

o      codes in position 3

 

 

forme_ecc

 

Hard-Coding of exceptional wordforms pattern(s).

 

-        les_id

o      link to corresponding line in lessario table (pr_key field)

-        add_lem

o      link to a second lemma through pr_key field in lessario table

-        enc

o      presence of an enclitic

-        c01

o      codes in position 1

-        c02

o      codes in position 2

-        c03

o      codes in position 3

-        c04

o      codes in position 4

-        c05

o      codes in position 5

-        c06

o      codes in position 6

-        c07

o      codes in position 7

-        c08

o      codes in position 8

-        c09

o      codes in position 9

-        c10

o      codes in position 10

-        pr_key

o      identification number of the line

-        ts

o      Time Stamp: last time when the line has been modified

 

 

tab_le

 

List of LE recorded along with its own codLE

 

-        lemma

o      list of LE

-        codLE

o      codLE

o      Value: see cod_le table, field cod_LE

-        les_id

o      link to