Workpackage 4: Old Norse Morphological Analyser


Year 3 Executive Summary


Timothy Tangherlini and Matthew Driscol

UCLA and University of Copenhagen



In year three we re-wrote the code for and refined a morphological analyzer and English language look-up tool for Old Icelandic that is interoperable with the CHLT-Perseus and Greenstone digital library systems.  Our teams at UCLA and the Arnamagnaean Institute at the University of Copenhagen linked the morphological analyzer to diplomatic and normalized editions of manuscripts (transcribed and XML-TEI hand tagged for CHLT at AMI), which in turn linked to images of the manuscript pages (created for CHLT at AMI), and integrated them into the CHLT-Perseus Digital Library System at Tufts University where we have incorporated the morphological analyzer/look-up tool with these diplomatic texts and manuscripts, as well as with the Standard Edition texts of the legendary sagas (Fornaldar sgur). We have also worked with Imperial College London to incorporate our tools and texts within the visualization programme developed for CHLT.


The details of this work break down into six thematic areas:


(i) Underlying Code: We changed the underlying code of the morpho-syntactic parser to make it  'object oriented' so that the rules set for Old Norse took the form of a 'module' (rather than hard coded in the actual parser).


(ii) Rules: We developed far more precise rules than the Year 2 version for the various word classes.


(iii) Phenomena: We planned a strategy to deal with certain phenomena unique to the language (umlat, Werner's Law, syncopation) and implemented these to increase parser



(iv) Lexical Sets: We fixed most of the problems with the lexical sets, making sure that we had eliminated most  spelling errors.


(v) Integration : We integrated the parser with the (a) diplomatic marked-up texts, (b) the

normalized marked-up texts, (c) integrated both with Perseus, and (d) integrated results with Imperial College visualization and clustering tool.


(vi) Dissemination and Exploitation of Results: We began work on a proposal for future work that builds on the results of CHLT.





Our results are revolutionary since it means that both students and scholars can search across the Fornaldas Saga's for the first time with morphological analysis tools that provide paradigms for all Old Norse words in Zoega; this opens up new ways of studying both the language and the literature of Old Norse which is for the most part inaccessible to the uninitiated and illusive to those that know the tradition.


The most tangible results of CHLT are the following:


1) Old Norse Morphological Analyzer (


2) Old Norse TEI-Transcription Guidelines (see below)


3) Creation of TEI hand-tagged transcriptions of Old Norse MSS


4) Creation of Images of Old Norse MSS


5) Integration of Old Norse Texts with Morphological Analyzer in Integrated

Reading Environment


 D. 4.7 :  and


(this is a dynamic web-based deliverable)



6) Co-authored article, by Urban, Aurelijus Vijunas and Tangherlini, Toward an

automated morphological analyzer for the study of Old Icelandic Texts, in

preparation for the Journal of English and Germanic Philology.


7) Strategy for dissemination and continuation of CHLT (see NSF Proposal below)



(2) Transcription Guidlines for Old Norse on CHLT Project


A. Text


1. The text should be transcribed exactly as it is with respect to orthography and spacing. With the exception of small capitals, used to denote geminates (principally N and R, but potentially also D, G, M, S and T), variant forms of the same letter (allographs) need not be distinguished. It may, in some cases, be deemed necessary also to distinguish between:


high and round s
ordinary and round r (r-rotunda)
ordinary and insular forms of f and v
ordinary and uncial forms of d, e, m and t

Note that only ligatures with an independent phonemic value (a and e, double a etc.) are to be represented (using the entities defined by MENOTA); ligatures which are the result of graphic economy should be treated as two separate characters (high s + t, for example).

2. Expand abbreviations in accordance with the normal spelling of the scribe in question, using <expan> to indicate supplied letters:


It is not necessary in first-level transcription to indicate the means by which the abbreviation is achieved, although one may choose to do so:

se<expan abbr="&bar;">m</expan>

Abbreviation by suspension may be distinguished from abbreviation by other means (contraction, supraliner symbol etc.) by means of the type attribute:

Hald<expan type="susp">anar</expan>

Expansions of such abbreviations can then be made to display in round brackets, as in a tradition printed edition.

3. Use <supplied reason="omitted"> to indicate letters or words assumed to have been inadvertently omitted by the scribe (which in a printed edition would be placed in angle brackets):

gieck sijdan <supplied reason="omitted">j burt</supplied>

4. Use <supplied reason="illegible"> to indicate letters now unreadable but assumed originally to have been in the manuscript (which in a printed edition would be placed in square brackets):

lid<supplied reason="illegible">z</supplied>

5. Where necessary to the sense, certain emendations and alterations may be made to the text; obvious misspellings, for example, should be corrected using <corr>, with the original reading given as the value of the SIC attribute:

<corr sic="giorit">giorir</corr>

With <supplied> and <corr> the attribute resp can be used (although this will not generally be necessary) to indicate the scholar or previous editor responsible for the conjectural emendation:

<corr sic="giorit" resp="MO">giorir</corr>

With <supplied> a source attribute is also available (but for first-level encoding normally not necessary), where the reading is taken from another witness:

ath &thorn;eir <supplied reason="omitted" source="AM 152 fol., 76ra">mundu</supplied> sundr ganga

Note that source is not available on <corr>, although logically it should be.

The <supplied> element should only be used when the missing text can be reconstructed with a very high degree of certainty. When such is not the case <gap> should be used instead, with both a reason and an extent attribute. The extent should be given as the number of characters presumed missing, which can then be made to display as a series of small noughts, as is customary in a printed edition.

6. Additions and deletions made in the manuscript by the scribe or in another hand should be indicated with the <add> and <del> elements; further information may (but need not) be given as attribute values.

<add place="margin" hand="scribe">&thorn;v&iacute; ha&nscap; var komi&nscap; fra godonum ok kalla&thorn;ur &stall;on Odins</add>

s<del rend="superpunction">a</del>umra

B. Structure

1. Indicate line-, column- and page-boundaries using the empty milestone tags <lb/>, <cb/> and <pb/>, giving a number for each as the value of the n attribute.

<pb n="1v"/>

These tags should come at the beginning of the line/column/page to which they refer.

2. Large structural divisions in the text, i.e. chapters, should be tagged using <div type="chapter"> and given a number; each <div> will contain one or more <p> elements.

3. Chapter headings should be tagged using <head>, which is placed immediately after <div> and before the first <p>. The nature of the <head>, i.e. whether it is found in the manuscript itself or supplied by an editor, should be indicated in the value of the TYPE attribute.

<head type="rubric">I. Cap<expan>itulum</expan></head>

<head type="supplied">Chapter 3</head>

4. Verses in the text should be tagged using <lg> (line-group) for stanzas and <l> (line) for individual lines:

Aukum nu elldana</l>
<l>ad Adilz borg.</l>

As they frequently are part of direct speech, verses will normally occur within a <p> (the DTD has been changed to allow for this).

C. Normalisation

1. Place each word inside an <orig> element, giving the normalised form as the value of the reg attribute

<orig reg="l&iacute;ka&eth;i">lijkadi</orig>

2. Compound words written separately in the manuscript should be grouped together within a single set of <orig> tags:

<orig reg="st&oacute;rilla">&stall;t&oacute;r illa</orig>

In the opposite situation, where for example a preposition and its object are written as a single word, the two parts should be treated as separate words, each placed within a set of <orig> tags, but with no space between them:

<orig reg="&aacute;">a</orig><orig reg="landi">lande</orig>

3. Marks of punctuation should be outside the <orig> tags.

4. Care must be taken to ensure that tags are placed in such a way so as not to overlap. Where one or more letters have been supplied within a word, for example, the <supplied> tag will obviously come inside the <orig> tag:

<orig reg="&thorn;ri&eth;jung">&thorn;<supplied reason="omitted">r</supplied>idiung</orig>

When an entire word has been supplied, it is in theory immaterial which element contains the other, but when two or more words are supplied the <supplied> tag must stand inside the <orig> tags:

<supplied reason="omitted"><orig reg="&iacute;">j</orig> <orig reg="burt">burt</orig></supplied>

Problems can thus arise in particular when supplied (or added or deleted) text begins within one word and ends within another. In such cases two sets of tags must be used, as in the following example (where the rend attribute has been used in order to ensure proper display):

<orig reg="Frigg">f<expan>ri</expan><supplied reason="illegible" rend="noclose">gg</supplied></orig> <supplied reason="illegible" rend="noopen"><orig reg="heyrir">heyrir</orig> <orig reg="b&aelig;n">b&eogon;n</orig> <orig reg="&thorn;eira">&thorn;eirra</orig> <orig reg="ok">ok</orig> <orig reg="segir">segir</orig></supplied>

M. J. Driscoll
Last update: 23.05.2005.

7)  Strategy for Dissemination, Expoitation and Continuation of CHLT:


2005 NSF Project Proposal based on CHLT results:


From a morphonological point of view, Old Icelandic is the most complex ancient Germanic language (Iversen 1994, Jnsson 1908, Krahe 1969, Noreen 1884, Steblin-Kamenskij 1953 and 1955).  Study of lexical, syntactic and morphological change in Old Icelandic, and the study of word use in Old Icelandic texts has been greatly enhanced by the NSF-EC funded Cultural Heritage Language Technologies Project.  CHLT has made possible a quantum leap in the study of Old Norse by creating for the first time sophisticated tools and modes of analysis for scholars in this field.  The creation of CHLT digital editions of Old Icelandic texts and images tagged in morpho-syntactic detail has made it possible to achieve disambiguation scores for this difficult language.


To develop the results of CHLT further, we propose to:


Augment our morphological analyzer for Old Icelandic to include grammatical disambiguation (morphological and syntactic) in context to develop automated orthographic normalization routines, coupled to a timelining function related to orthographic change; refine the output of the morphological analyzer; greatly expand the underlying lexical dataset for which the analyzer currently works; expand the English language lookup tool to include all of the entries in Cleasby-Vigfusson; integrate these tools with a widening corpus of both Standard Edition and diplomatically transcribed Old Icelandic texts; integrate these components into several digital library environments; and adapt the morphological analyzer code on a test basis to Old English.


Our next goal is to develop a series of automated grammatical disambiguation routines for Old Icelandic. Because of the phonological and morphological complexity of Old Icelandic, a great number of ambiguous forms existeither identical forms can be derived from multiple lemmata (eg.  fara gen. pl. of the neut. noun far a means of passage; or inf. / 3rd. pl. ind. prs. / 1st sg. sub. prs. of the verb fara to travel), or identical forms exist within the paradigm of a single lemmafor example oblique forms of nouns are often ambiguous in regards to case (eg. bleytu, acc./gen./dat. sg. of bleyta mud). In some instances, both are true (eg. unni  a) 3. p. sg. preterite indicative active of the verb unna to love; or b) 3. p. sg./pl. present subjunctive active of the verb unna to love; or c) acc./dat. sg. of the feminine noun unnr wave).  Our system will incorporate rules that disambiguate between these forms in context, offering greater precision in identifying part of speech and case/person for a given form. We are not proposing to undertake automated semantic disambiguation at this stage (differentiating between possible meanings for a single lemma eg. far can mean (1) a means of passage, ship (2) passage (3) trace, print, trace (4) life, conduct, behavior (5) state, condition).


In cases where grammatical disambiguation cannot occur with 100% surety, the routines we develop will provide a measure of statistical likelihood between the possible choices (described in greater detail below in section 2.c). The disambiguation routines will allow for greater precision of the morphological analyzers output when used in conjunction with the growing digital corpus of Old Icelandic prose texts (predominantly sagas). Searches for words and phrases in a text corpus in which ambiguous forms have been further tagged with information concerning the disambiguation of those forms will yield far better results than similar searches on unmarked texts or marked texts in which ambiguous forms are not properly tagged, as the noise generated by ambiguous forms will be greatly reduced or eliminated. Consequently, evaluations of the results will be more meaningful to researchers. Such queries will allow for quite specific searches concerning word use, overall vocabulary, linguistic change over time, regional usage of words (to the extent that this can be determined) as well as specific aspects of syntax and grammar. Our disambiguation and tagging of Old Icelandic texts is an important first step toward building a linguistic tree bank and eventually a parsed corpus of Old Icelandic.


To expand the range of the morphological analyzer and the disambiguation routines, we propose to increase greatly the lexical set to which the current analyzer is linked. Currently, the analyzer has a limited lexical database derived from Zogas Old Icelandic Dictionary (1910). This dictionary is a subset of the standard English language dictionary for Old Icelandic, CleasbyVigfussons An Icelandic-English Dictionary (1874). We will incorporate all of CleasbyVigfusson in our new dictionary tool, and further expand the lexical set with the Ordbog over det Norrne prosasprog (ONP) (Degbol et. al. 1995-), a comprehensive list of all lemmata in the Old Icelandic prose language (~approx. 68,000 lemmata). While CleasbyVigfusson includes adequate information concerning irregular forms, ONP does not. Accordingly, our database of exceptions and irregular forms will initially only cover the 40,000 or so lemmata found in CleasbyVigfusson; we will develop an easy user interface for the updating the irregular forms of words found in ONP not occuring in CleasbyVigfusson so that, by the end of the grant period, the underlying database of exceptions and irregular forms should be nearly complete. In turn, this more complete table of exceptions will increase the accuracy of the morphological analyzer.


We will also want to develop normalizing routines for Medieval Icelandic orthography. The architecture of the normalizer already exists in our Normalizer module used by the morphological analyzer to standardize lexical database input (See section 2.b below). The advanced orthographic normalizer will include a timelining feature that will take into account orthographical changes in Icelandic up to the 15th century. Changes in Icelandic orthography were often related to changes in phonology. Unlike phonological changes, orthographical changes were reflected rather inconsistently particularly in later manuscripts. Incorporating a time-lining feature will help map orthographic change and will also allow for standardized searching on forms across manuscripts written in different time periods. We consider this time-lining of orthographic change to be an important feature since the morphological analyzer/disambiguator is intended to work with normalized text as well as diplomatic transcriptions of manuscripts from various periods and various scriptoria. The normalizer will also allow us to back-normalize Old Icelandic texts written with modern Icelandic orthography, greatly expanding the number of texts available to the system.


We will continue to use the Fornaldar sgur (Legendary Sagas) as our test platform. The corpus includes Standard Edition (SE) texts as well as diplomatic transcriptions of variant manuscripts on which those SE texts are based. We propose to apply our disambiguation routines to the standard edition (normalized) texts, as well as to the automatically marked-up diplomatic editions of manuscript variants of Fornaldar sgur texts that we have developed during the past three years. The morphological analyzer, the disambiguation routines, and the XML-tagged SE and diplomatic edition texts will be incorporated into the Perseus digital library at Tufts University and into a Greenstone Digital library environment at UCLA. Automatic XML markup will proceed according to the conventions for Early Scandinavian mark-up described in the MENOTA handbook (Menota 2003).


Because of the multiple structural similarities between Old English and Old Icelandic, we plan to adapt the architecture of our morphological analyzer/disambiguator on a test basis for Old English. We will likely base the look-up tool on the available headwords of the Dictionary of Old English from the University of Toronto. The goal of this adaptation is to show how the architecture of our morphological analyzer can be applied to the other ancient Germanic languages. From the morph(on)ological point of view, the other ancient Germanic languages are less complex than Old Icelandic, and the derivation of analyzers and disambiguation for them should be less difficult to implement. We expect our work to have import for the development of morphological analyzers for other Indo European languages.


Objectives of the Project


Develop disambiguation routines for addressing ambiguous forms in Old Icelandic texts (Sections 2.c and 2.d)

Develop a method for automatically scoring results of disambiguation; and integrate these statistical scores into the XML markup (Sections 2.c and 2.d)

Develop an orthographical normalizer and a time-lining feature to account for orthographical change in Old Icelandic; expand the number of digital texts available in standard Old Icelandic orthography (Section 4)

Refine the output of the existing Old Icelandic morphological analyzer primarily by increasing the size and accuracy of the table of exceptions (Sections 2.a, 2.b and 3)

Expand the underlying lexical set to include all of the lemmata in Cleasby-Vigfusson; supplement this set with lemma from Ordbog over det norrne prosasprog using a simple webform accessible to project developers and expert users (Section 3)

Expand the English language lookup tool (Section 3)

Port the code of the morphological analyzer to Old English on a test basis (Section 6)

Integrate the second generation morphological analyzer/disambiguator with digital library systems (Perseus and Greenstone), visualization tools (Greenstone, Spire, Cascade, Navigational View Builder etc.) and other text analysis tools (eg. Wordstat, Xaira, Juxta); explore export of the system as a SCORM learning object (Section 5)


Morphology, Disambiguation and Old Icelandic


2.a Morphological Complexity of Old Icelandic and automated morphological analysis


Compared to the other ancient Germanic languages, the morphonological system of Old Icelandic is relatively complex and in many ways irregular. This complexity stems from the instability of the phonological system and multiple irregular developments, as well as from an ambiguity of endings and active processes of analogy. Many of these phenomena took place before Icelandic was established as an individual language, while others affected the language in the course of its internal development.


The nominal system of Old Icelandic consists of sixteen inflectional classes, most of which can be further subdivided into subclasses. The Old Icelandic adjectives can be grouped into two morphologically and semantically distinct classes of strong (indefinite) and weak (definite) adjectives. The verb system consists of three large classes of the so-called strong, weak and preterito-present verbs. Each of these classes pose specific challenges to the development of an automated morphological analyzer.


In the Germanic proto-language, nouns of different classes were characterized primarily by different suffixes and at times by different endings. In the course of development of the Germanic languages, the various suffixes frequently merged with endings by means of various phonological processes, and eventually disappeared as independent morphemes. The new endings, which in early Germanic were still quite different from each other, were affected by special Germanic phonological rules (Verners law, reduction of unstressed vowels) and in several instances became homorganic, cf. feminine ō-stem genitive ending *-ōR (< *-ās) and u-stem genitive ending *-ōR (< *-ous), both of which evolved into -ar in Old Icelandic. Also, many endings could be added to nouns of more than one gender. In numerous cases, the ambiguity of endings caused confusion of inflections, transfer of nouns from one class to another, or paradigmatic split (Gutenbrunner 1951, Krahe 1969, Noreen 1884)


The endings of Old Icelandic verbs are relatively straightforward. Much more problematic are the morphologically and phonologically conditioned vowel alternations in the root, which can significantly affect the shape of the root in different forms of the same word. Analogical restorations and transformations, which always work against regular phonological development, have contributed to the creation of numerous by-forms and parallel paradigms. 


As with the nouns, there have been numerous transfers of verbs between classes and paradigmatic splits. Due to their phonological shape, many archaic strong verbs developed in irregular ways, eventually developing abnormal paradigms. Already at an early stage, native speakers created alternative regular paradigms for such verbs, and in many instances irregular verbs possessed more than one paradigm (in some cases as many as six, cf. the verb grva make, which due to aberrant shape in the course of development acquired five by-forms, cf. gera, gerva, gra, gjra, gjrva, each having its own paradigm). Our morphological analyzer deals well with this type of complexity, relying both on calculation of regularly produced forms, and an underlying table of exceptions for irregular forms that cannot be calculated. For example, it currently returns the following complex paradigm for gra:










































Infinitive: gra
Present Participle: grandi
Past Participle: grr









































Infinitive: grask
Present Participle: grandisk
Past Participle: grzk
















Despite the ability of the morphological analyzer to deal with complex paradigms, the current lexical database does not account for all five secondary forms, but rather uses Zogas pointers of gera to gra, gerva to grva, and gjr- to gr- or gr-; Zogas standard form gra includes the secondary form grva, which is actually the original form and should be the default form. This lack of clarity regarding secondary forms in our underlying lexical database will be addressed in the extension of the lexical database and the expansion of the table of exceptions.


A large part of the complexity of Old Icelandic morphonology can be attributed to the phonological processes of umlaut and breaking, which affect the shape of the stem in various ways, cf. sag-a saga (nom. sg.) vs. sg-u (obl. sg.; u-umlaut changes /a/ to //), or berg save (1. p. sg. pres.) vs. bjargið save (2. p. pl. pres.; a-breaking changes /e/ to /ja/), etc. In many cases, more than umlaut (or umlaut and breaking) obtains, cf.  skkva sink (transitive v.) (< *sankwijan; the root vowel /a/ undergoes u/w-umlaut and then the resulting *// undergoes i-umlaut to //).


In those word-forms where the conditions for an umlaut did not exist, it did not occur. This lack of umlaut resulted in different forms of the same word having different shapes (allomorphy), cf. sk (nom. sg.) vs. sak-ar (gen. sg.), or sag-a (nom. sg.) vs. sg-u (obl. sg.). In those cases, where more than one umlaut (or umlaut and breaking) operated, the number of allomorphs rose accordingly, cf. fjrðr fiord (nom. sg.; < *ferþ-uR; u-breaking: e > j /_Cu) vs. firði (dat. sg.; < ferþ-ī; i-umlaut: e > i /_Ci) vs. fjarðar (gen. sg.; < *ferþ-aR; a-breaking: e > ja /_Ca). As a result, paradigms can become quite complex, such as the paradigm for fjrðr:


Singular                                                           Plural

Nom. fjrðr                 (< *ferþ-ur)                firðir    (< *ferþ-īr)

Acc. fjrð                                (< *ferþ-un)                fjrðu (< *ferþ-unn)

Gen. fjarðar                 (< *ferþ-ar)                fjarða   (< *ferþ-an)

Dat. firði                                 (< *ferþ-ī)                  fjrðum           (< *ferþ-umm)


Again, our morphological analyzer deals quite well with this type of complexity, and accurately returns:


















In addition to umlaut and breaking, Old Icelandic exhibits other complex phonological features. Another common phenomenon is syncope of unstressed vowels. However, the rules for its operation are not easy to define. Syncope tends to occur in words which in the protolanguage were trisyllabic (or longer). However, it is reflected in by no means a regular way, cf. jtn-ar giants (nom. pl.; 2 syllables) < *jt-un-ar (3 syllables), but skrif-ar-ar scribes (3 syllables). Syncope is quite irregular among adjectives, operating in some words, and not operating in others, even though they may belong to the same derivational type, cf. ml-i-gr talkative acc. sg. masc. ml-gan, but kunn-i-gr expert acc. sg. masc. kunn-i-gan. For mligr, for example, our morphological analyzer accurately returns:

















In other cases, our morphological analyzer returns results that are incorrect. Syncope is one of the ongoing challenges as we refine our automated morphological analyzer.


Along with syncope, phenomena related to the phonological changes to consonants and consonant clusters as a result of the processes of assimilation, dissimilation, degemination, devoicing in word-final position and Verners law pose a challenge to our automated morphological analyzer. All of these irregularities, while fairly well addressed in the current morphological analyzer, require a degree of attention that we have as of yet been unable to consistently apply across all word classes. However, we have been able to describe these phenomena well, and will implement these descriptions as refined rule-sets in the Target Language module described below in conjunction with our planned expansion of the lexical dataset, the table of irregular forms and the English language lookup tool.


2.b The Old Icelandic Morphological Analyzer: Architecture and Implementation


The morphological analyzer produces word form tables based on lemmata from the Zogas lexicon. In addition, it comments on its computations to arrive at the final output. For example, given a head word barn the analyzer performs a lexicon lookup to retrieve the following information from its digital copy of the Zogas lexicon:


barn | barn | E | n | (1) bairn, child; vera með barni, to be with child; ganga með barni, to go with child; barns hafandi or hafandi at barni, with child, pregnant; fr blautu barni, from one's tender years; (2)  = mannsbarn; hvert b, every man, every living soul


Each lexicon entry consists of five fields: the headword itself, its original form in the lexicon, declension information (which in this case is empty as signaled by the symbol E), its part-of-speech, and finally its translation and usages.


Given this information and an internal representation of the phonology and morphology of the target language Old Icelandic, the morphological analyzer determines and outputs all potential paradigms:


barn, noun, gender: n, a-stem

















1) bairn, child; vera með barni, to be with child; ganga með barni, to go with child; barns hafandi or hafandi at barni, with child, pregnant; fr blautu barni, from one's tender years; (2) = mannsbarn; hvert b, every man, every living soul


The user can choose to output the analyzers internal application of its linguistic rules, allowing students of Old Icelandic to understand the derivation of the forms found in the paradigm. For each output form, it lists the phonological and/or morphological rule underlying its change:


Lexeme: barn
Gender (if any): n
Declension info: nom_sg E
Stem (if any): a

The root is barn.

I found a stem: a.

Root consonants: b - r n
Root vowels: - a - -
Root vowels only: a

Sound changes for element Nom Sg:

Sound changes for element Acc Sg:

Sound changes for element Gen Sg:

Sound changes for element Dat Sg:

Sound changes for element Nom Pl:
     u-mutation to neut a-stem, nom & acc pl ...
Sound changes for element Acc Pl:
     u-mutation to neut a-stem, nom & acc pl ...
Sound changes for element Gen Pl:

Sound changes for element Dat Pl:
     Regular u-mutation ...n

Figure 1: Parsing detail from the Old Icelandic Morphological Analyzer


The design and implementation of our morphological analyzer is guided by two main principles. An object-oriented layout allows for its adaptation to languages other than Old Icelandic. In addition, its separation between linguistic rules, natural language resources, and the code itself enables the user to add new language resources. Both design principles are of major importance in regards the scalability of the analyzer. The analyzer essentially works as a two-level morphological analyzer as described in part by Koskenniemi (1983; 1986) and Kartunnen (1983) and later refined by others (Karttunen, L., Koskenniemi, K., and Kaplan, R. M. 1987; Antworth, E. L. 1990; Pulman, S. 1991; see also Karttunen, L. and Beesley, K. R. 2001). Our analyzer accepts as input either lemmata from the lexical database, and outputing the paradigm for that headword; or forms from a text and outputting all possible lemmata and their paradigms (with the form clearly marked) for the input form 


The analyzer code is written in Perl (, a programming language particularly suited for manipulation of Unicode and plain text strings. In addition, it allows for the creation of classes, i.e. an object-oriented architecture. Some attractive features of object-oriented programming are the hierarchical structuring of classes, the control over variable declarations and user permissions, and a high degree of convergence between the application design and its problem space. Figure 1 illustrates the general architecture of the analyzer:


Figure 2: General architecture of the morphological analyzer.


Currently, the Lexicon module consists of an electronic copy of the Zogas Old Icelandic lexicon, excerpts from Old Icelandic sagas and a table of exceptions that overrides the output of forms calculated by the morphological analyzer where appropriate. To ensure scaleability, the analyzer has been designed to accept input from various language resources. This design feature makes the incorporation of other lexical databases quite straightforward, and will allow us to implement the CleasbyVigfusson additions, as well as the ONP additions, in an efficient manner.


In operation, the morphological analyzer expects a normalized form of lexical entries as its input. This is accomplished by the Normalizer module. Compare for example the following entry in Zoga:


barna-brn, n. pl. grandchildren;


with its normalized version which is accessed by the analyzer:


barnabrn | barna-brn | E | n pl |  grandchildren

The normalization occurs automatically. The general features of the Normalizer module will be expanded as the underlying structure of the Orthographic Normalizer module (see Section 4 below). This latter module will be used in conjunction with input texts rather than lexical databases.


The Target Language module contains information regarding the target language such as phonological and morphological rules. For example, the morpho-phonetic rule for the excision of consonants in Old Icelandic is represented as Perl pseudo-code:


RULE:             excision_consonant

CONDITION: rootc(-2) ne - && rootc(-1) ne - && tmp(0) eq rootc(-1)

ACTION:        shift tmp


In this rule, the morphological analyzer deletes a given consonant if certain conditions regarding the consonantal structure of the lexeme root are met. The rule set in the Target Language module contains the majority of phonological and morphological rules for Old Icelandic. Accordingly, only a few linguistic rules are hard coded into the analyzer. Our goal is to achieve complete separation between the Target Language module and the morphological analyzer itself. By implementing this separation, the Morphological Analyzer will be able to interact with Target Language modules and Lexicon Modules of other languages, such as Old English. In addition to a linguistic rule set, the Target Language module consists of several databases for language specific data such as exceptions, umlaut information, and word ending paradigms.

The third module in the architecture is the morphological analyzer itself. Upon being called, it determines the root structure of a word from the Lexicon module based on the rules and definitions in the Target Language module entry for Old Icelandic. Once it determines its part-of-speech, the analyzer creates a paradigm, performs the appropriate morpho-phonetic changes, and finally outputs the paradigm.


2.c Grammatical Disambiguation


Our proposed system for the grammatical disambiguation (both morphological and syntactic) of Old Icelandic will rely directly on accurate output from the morphological analyzer coupled to a significant library of digital versions of Old Icelandic texts. While there are a variety of disambiguation strategies, this typeof supervised disambiguation yields better results than unsupervised disambiguation (Manning and Schtze 2000).


Ambiguous forms arise in several ways in Old Icelandic. The ambiguity of the grammatical endings is generally a result of convergent phonological development. Respectively, ambiguity of the endings is one of the causes of morphological analogy and paradigmatic reformation. Analogy and paradigmatic levelling can also be caused directly by phonological processes. In such cases, aberrant phonological development creates allomorphy within a single paradigm. The development of allomorphous paradigm can follow several differents courses. Sometimes, the more prominent allomorph may push out the less prominent one, cf. the present singular active paradigm of the verb eta eat, in which the more prominent allomorph et- pushed out the less prominent *jt-, expected in the 1. p. sg. present.  Conversely, allomorphy may be preserved, cf. the paradigm of fjrðr above (section 2.a). Or, finally, one may encounter paradigmatic split as in the verb grva (see section 2.a; on analogy see Sturtevant 1957).


The disambiguation routines we plan to develop will allow for varying levels of end-user expertise and rely on scoring the results for each ambiguous form. Basic users will likely want to accept the high score suggestions of the disambiguation routines, while users with a strong background in Old Icelandic may want the ability to override the suggestionsor consider all of the scored outputof the disambiguation routines. Searches on the corpus will allow users to toggle on and off the disambiguation functionsresults of these searches can then be passed to various statistical tools (estimates of proportion for the occurence of forms in a corpus, calculation of z-scores for such forms, and other standard measures of word-use, co-occurence and vocabulary incorporated into textual analysis systems such as Wordstat or Xaira) and visualization tools (such as those developed at Imperial College and incorporated in the most recent release of Greenstone). Disambiguation will also contribute significantly to meaningful clustering and key-term extraction routines. Finally, this automated, supervised disambiguation  is an important component of developing a linguistic tree bank for Old Icelandic and subsequently a parsed corpus.


2.d Design and Implementation of Disambiguation Routines


In Germanic and most other natural languages, word order follows patterns (Duda, et al 2000). To varying degrees, they may be enforced by the grammar of a language. On a sub-sentence level, words can often be combined to form phrases. For example, the English phrase the old man is an instantiation of the abstract pattern Determiner Adjective Noun. Old Icelandic, too, contains patterns of word order. This fact lies at the heart of disambiguation based on phrase structure dependencies.  For example, in the following excerpt from The Saga of Grettir the Strong, notice the context of menn men, envoys:


En er þeir frttu þat, Þrir haklangr ok Kjtvi konungr, þ sendu þeir menn til mts við þ ok bðu þ liðs ok htu þeim smðum.

[] and when Thorir Long-chin and Kjtvi the King heard of their landing they sent envoys to ask for their aid, promising to treat them with honor.


The form menn can be found in the paradigm for maðr:


maðr, noun, gender: m, r-stem

















According to the paradigm, menn could be either nominative or accusative plural. To resolve this ambiguity and determine the correct form, we can analyze the context window in which menn occurs:


þ sendu þeir menn til mts við


The form sendu is uniquely identified by the morphological analyzer as active indicative 3rd person plural verb (they) sent. In addition, þeir is uniquely identified as masculine nominative plural they. Given the fact that a pattern such as Verb Subject Object occurs with high frequency in Old Icelandic texts, the disambiguation tool would correctly determine that the above instance of menn is accusative plural.

The set of word order patterns is currently not available. To create it, we will analyze each word of our text corpus using the morphological analyzer. For each cluster of uniquely identified forms, their pattern of grammatical dependency will be added to the pool of possible phrases. Thus, given a phrase like sendu þeir menn, the first two words will be uniquely identified as: Verb (active past, 3rd plural) Noun (nominative) and added to the pool of permissible phrases. At the end of this process, we will have an inventory of permissible phrase structures together with their frequency of occurrence in the corpus.

Our disambiguation strategies depend on local clues to correctly disambiguate a form. If no clue is provided, these algorithms fail. A straightforward method to improve their success rate is to expand their application to a global level, i.e. corpus-wide analysis. Here, the idea is to include similar or identical phrases that occur elsewhere in the corpus in the decision-making process.

In Saussurian linguistics, the words form a paradigmatic relationship if they occur in the same linguistic environment. For example:


directing     {my, the, a, }     call


In this case, the words my, the, a, form a paradigmatic relationship. Conversely, a syntagmatic cluster of words shares the property of occurring with the same form, as in


            fiscal {policy, institution, responsibility, year, }


During a corpus-based paradigmatic analysis, the algorithm finds all occurrences of the context of a given form. Thus, given the text excerpt


Þ mlti Guðrn til sinnar vinkonu


from Vlsunga Saga with Guðrn being the current form to disambiguate, a search for phrases with identical context mlti ___ til yields the following results:


Þ mlti Guðrn til Gunnars

Ok er þau vknuðu, mlti hn til Hgna     

Þ mlti Bikki til Randvs


The search results provide the disambiguation algorithm with three more opportunities (Guðrn, hn, Bikki) to apply its local context analysis to determine the correct grammatical form.


In similar fashion to the paradigmatic search, the syntagmatic searches for all occurrences of the form in question. Using the same form Guðrn, a search of the saga text yields multiple results:


Guðrn ht dttir hans.

Eitt sinn segir Guðrn meyjum snum at hn m eigi glð vera.

Guðrn svarar:

"Þar mun vera Guðrn Gjkadttir," segir hn.



For each of these search results, the local dependency algorithms can be applied.


Global searches have the advantage of offering multiple opportunities to the disambiguation tool to determine the grammatical nature of a form. Their downside is, however, that a corpus-based search may yield more than one possible solution. The most commonly applied strategy for decision-making is based on calculations of frequency or probability. One such way of deciding on a form which yields multiple solutions is to calculate the mean and variance of the contexts of a particular result. For example, given the above form Guðrn and its multiple contexts from the syntagmatic analysis, we would like to find out which of the contexts


[Empty] ___ ht

segir ___ meyjum

[Empty] ___ svarar

vera ___ Gjkadttir


occur relatively often in the corpus at roughly the same distance. To that end, we compute the variance



where N is the number of times the context occurs, xi is the offset between the two contexts, and is the sample mean of the offsets. The square root of this formula is the variance of a given context; the smaller the variance, the more likely a given context occurs often. In turn, this indicates that a context with low variance is more likely to yield the correct grammatical interpretation of a form. This calculation of variance will allow us to assign a score to each result. For each ambiguous form, these scores, the part of speech information and lemma can be automatically encoded in the XML tag. The definition of this element will be added to the Menota handbook.


The level of ambiguity in the above examplesand in Old Icelandic texts in generalranges from the very low (or non existent) to quite high. While the goal of the disambiguation program is not to provide absolute disambiguation (nor is it intended for automatic translation although it can certainly assist in machine-assisted translation), it should allow for users of various backgrounds the opportunity to undertake sophisticated and nuanced searches of a large text corpus. Grammatical disambiguation is a multi-faceted linguistic and computational problem. In our opinion, it should be approached by a multi-tiered strategy of local, global, and probability-based solutions.


Expanding the underlying lexical set


A current limitation of the morphological analyzer is the fairly small lexical set of its corpus. Zogas subset of CleasbyVigfusson has been instrumental in our ability to develop the morphological analyzer but needs to be expanded in order to deal with the lexical diversity of the text corpus. Expanding the lexical set will also result in a refinement of the table of exceptions. Both developments will greatly improve the performance of the morphological analyzer in a real textual environment. Furthermore, expansion of these underlying lexical sets will greatly improve the accuracy of the disambiguation routines.


Initially, we intend to focus on incorporating all of the CleasbyVigfusson lexical data into the underlying database. Definitions from CleasbyVigfusson will also greatly enhance the usability of the English language lookup tool. As unexpected, rare or unusual forms arise in the saga texts (words not covered by CleasbyVigfusson), we will supplement the dataset with information from the ONP. In collaboration with researchers at the ONP, we have already harvested all of the headwords from that project, along with the minimal part-of-speech information currently in their database. As we encounter lemmata not in CleasbyVigfusson, we can input information from the non-digital ONP archive via a webform that we will develop specifically for this purpose.


Because of the architecture of our system, all normalized lexicon entries share the same structure regardless of their source document. The normalization process relies on a library of rule objects and each object contains the layout rules for a particular lexicon, thereby allowing the Normalizer module to correctly interpret lexicon entries. Currently there exists only one rule object that our Normalizer accessesnamely that for the Zogas dictionary. To integrate the Cleasby and Vigfusson lexicon and harvested lemmata from the ONP (or any other Old Icelandic lexicon, for that matter), our team will create a new rule objects and add it to the library. This system of rule objects will allow us to expand the underlying lexical database incrementally, while continuing to work on the more challenging tasks of disambiguation and orthographic normalization.


Orthographic change and time-lining


Many of the Old Icelandic texts available in digital form use different orthographic conventions. While some of these conventions are a matter of simple substitution, others are significantly more complex. Furthermore, diplomatic editions of manuscripts follow orthographic conventions in place during the time of writing. All of these orthographic differences need to be normalized for morphological analysis and disambiguation to take place. At the same time, significant information concerning language development exists in the orthographic conventions of a particular era.


We propose to develop a series of normalization routines that will allow any medieval Icelandic text to be normalized to a standard orthography. This standard orthography will be used as a REG field as defined by the MENOTA handbook for the XML markup of the text in question, allowing the original orthography to be accessed by the end user. The end user will also be able to toggle between texts to take advantage of the morphological analyzer and disambiguation routines within a digital library environment. A time-lining function will allow an end user to call up texts written with a particular orthographic convention, as well as the normalized version of that text. There are significant challenges associated with developing such timelining protocols. Perhaps one of the most challenging elements is that archaic orthographic features tend to re-occur in later manuscripts. It may well be that we will need to develop a specific orthographic module for each individual manuscriptdescriptions of the orthographic features of the document will be incorporated into the metadata describing the digital text, allowing it to function with both the normalizer and the timelining functions. Significantly, the orthographic normalizer will allow us to use digital versions of Old Icelandic texts normalized to modern Icelandic orthography, by renormalizing these texts to Old Icelandic orthography.


We have begun describing the rules for the orthographic normalizer and believe the implementation of these rules will be relatively straight-forward, given our implementation of the Normalizer module described above. That does not imply that there are not challenges inherent in this task. The differences between a diplomatic transcription of a manuscript, standardized Old Icelandic and Modern Icelandic orthography for example can be seen in the following short text samples:


Text samples from Victors saga ok Blvus (Loth 1962)

Diplomatic text

 Standardized Old Icelandic

Modernized Old Icelandic

...kongr gerdjzt hliodr eirn

...kngr grðisk hljðr einn

...kngur gerðist hljður einn

dag er þau Alba satu bði

dag er þau Alba stu bði

dag er þau Alba stu bði      

samt ok tavlvdvzt vid...

samt ok toluðusk við...

samt og tluðust við...          veitzlunj vt endadri veizlunni t endaðri

...að veislunni t endaðri

uoru allir herrar ok

vru allir herrar ok

voru allir herrar og

haufdjngiar vt leyster

h ofðingjar t leystir

hfðingjar t leystir

med agitum giofum...

með gtum gj ofum...

með gtum gjfum...


...sau þeir fostbrdur

s þeir fstbrðr þeir fstbrður                

at þar var allr sioR svartR

at þar var allr sjr svartr

að þar var allur sjr svartur

sem kolum wri saad...

sem kolum vri st...

sem kolum vri sð...


The rules we expect to develop fall into two main areas, phonology (vowels and consonants) and morphology. As we expand the range of the orthographic normalizer, rules will be added to account for incremental changes in orthography from the earliest writing up through the present (this latter category is of course only applicable for Old Icelandic texts that have been normalized in the digital realm to modern Icelandic spelling).


Ongoing expansion of the text corpus and integration with other systems


The development of normalization routines will immediately allow us to expand the digital corpus on which the morphological analyzer, lookup tool, and disambiguation routines to all extant digital editions of Old Icelandic texts. Collaboration with the University of Iceland (Arnamagnaean Institute), the University of Copenhagen (Arnamagnaean Institute) and the Ordbog over det Norrne Prosasprog, will greatly facilitate this process. A collaboration with Matthew Driscoll at the Arnamagnaean Institute in Copenhagen surrounding the ongoing digitization of diplomatic editions of the manuscripts that form the basis of all standard edition Old Icelandic texts further insures that the corpus will not be limited solely to standardized texts, but rather will afford researchers the opportunity to work online with variant manuscript texts. The XML encoding of all these texts to normalized spelling, part of speech information (from the morphological analyzer) and disambiguation scores (from the disambiguation routines), will greatly enhance the ability of end users to carry out sophisticated searches and analyses of a significant component of the extant Old Icelandic corpus. It will also likely contribute to the eventual creation of a parsed corpus of Old Icelandic.


We will continue to work closely with the Perseus project to integrate the texts and the tools into the Perseus digital library project. We will also continue the development of our own Greenstone Digital Library site at UCLA, and will mirror this site at the University of Copenhagen. We will continue to explore ways in which to integrate the system and the tagged texts with developing systems so as to take advantage of the latest advances in textual analysis and visualization tools, and will also explore exporting the system as a SCORM learning object.


Porting to Old English


Porting our work to another early Germanic language will allow us to test the rules-based approach to automatic morphological analysis and our underlying architecture that separates the Target Language rules from the analyzer itself. At the same time, it will provide a quick and efficient way for the automatic morphosyntactic markup of Standard Edition Old English texts.


      We have chosen Old English as our test project for several reasons. Although Old Icelandic and Old English belong to different branches of the Germanic group of the Indo-European language family, their morphological systems are relatively similar to each other. Both languages share the division of nouns, adjectives and verbs into strong and weak, which is inherited from the Germanic proto-language. Also the stem classes of the various parts of speech are essentially the same in both languages (see Krahe 1969, specially for OE, see Campbell 1959).


      The Old English morphological analyzer will work primarily with the currently limited online Dictionary of Old English at the University of Toronto as the input for its  Lexical module. Of course, given the architecture of the system, any Old English lexicon can be attached once a rule object for that lexicon has been developedwe will make information on how to write a rule object readily available on our project site so that interested parties can write their own and import their lexica. We expect that the underlying lexical set can be expanded to include the online edition of Bosworth--Toller (1898) as it becomes available.


      We consider the porting of the morphological analyzer to be an important test of the scalability of our architecture to other Germanic languages. Old English is complex, yet sufficiently related to Old Icelandic that developing a Target Language module for the morphological analyzer should proceed smoothly. Indeed, similar to the Natural Language component of the analyzer, the ability to handle multiple target languages will be accomplished by adding language objects into the library of target languages. For a specific request, the morphological analyzer accesses the appropriate language object to apply the necessary phonological and morphological rules. We will limit the scope of our Old English Target Language module to the West Saxon dialect (the standard Old English dialect), and specifically to nouns and verbs in the first instance. This adaptation of the underlying architecture of our morphological analyzer to Old English will not only help substantiate the applicability of our approach for morphological analysis to Germanic languages in general, but also extend to other Indo-European languages as well.



Work Plan


We propose a three year horizon for the development and implementation of our proposed project.


In the first year:


Assemble and describe rules for orthographic normalization. (Vijunas, months 1-3)

Complete digitization of CleasbyVigfusson, and insure that the lexical database conforms with the requirements of our lexicon module; these materials will be ported to Perseus to expand the reach of the lookup tool for Old Icelandic in their system (Tangherlini and graduate student researchers (GSR), months 1-12)

Develop a system for the incorporation of lemmata from the ONP into the lexical database (Tangherlini, months 1-2)

Optimize the current morphological analyzer for speedier lookup; and refine several routines that occasionally do not return the proper output (Urban, months 1-6)

Devise  the second generation orthographic normalization module; and implement the first set of orthographic normalization routines (Urban and PA, months 7-9)

Draw up rules for the most common situations in which ambiguous forms arise (Vijunas, months 3-9)

Develop the alpha version of the disambiguator, including scoring of ambiguous forms in the Legendary sagas. (Urban and PA, months 10-12)

Develop rules for the Old English lexicon normalizer and implement them (Vijunas, months 10-12)





In the second year:


Develop and implement routines for disambiguation of the most commonly occurring situations based on a computer-driven analysis of the Legendary sagas (Vijunas and Urban, months 13-15)

refinement and optimization of our proposed algorythms for disambiguation (Urban and PA, months 13-15)

Incorporate The Family Sagas (back-normalized to Old Icelandic) into the underlying text corpus to increase the accuracy of the disambiguation routines and scoring (Tangherlini, months 13-15)

Incorporate our disambiguated texts into a Greenstone Digital Library implementation at UCLA (Tangherlini and GSR, months 15-18)

Continue to analyze and describe ambiguity in Old Icelandic (Vijunas, months 16-20

Refine the disambiguation routines (Urban and PA, months 16-20)

Identify all ambiguous forms for which the disambiguator cannot provide adequate scoring; explore if routines can be developed for these forms (Vijunas and GSR, months 16-24)

Develop rules for West Saxon verbs and nouns (Vijunas, months 16-24)

Incorporate these rules into a test Target Language module for Old English (Urban and PA, months 21-24)

In the third year:

Expand the orthographical normalizer to account for orthographic change from the 11th to the 15th centuries (GSR, Vijunas and PA, months 25-28)

Refine disambiguation routines and scoring (GSR, months 25-36; Urban and PA, months 25-31)

Refine Old English (West Saxon) analyzer (Urban, Vijunas and PA, months 25-31)

Release Beta-version of the disambiguator, and publish all parameters for rule sets for the adaptation of the system to other Germanic languages (Tangherlini, months 34-36)




     Among the most significant outcome of the project will be a well integrated series of tools that provide for an accurate morphological analysis that accounts for the phonological and morphological complexity of Old Icelandic; an English language lookup based on a nearly comprehensive lexical set for Old Icelandic; orthographic normalization routines that allow for searches, analysis and visualization on a wide range of Old Icelandic texts, irrespective of the orthographic conventions used; and the disambiguation of forms in context allowing for more accurate textual analysis (including pattern matching, clustering, and keyword extraction) as  a first step toward a parsed corpus of Old Icelandic.


Our work will make more accessible for linguistic and comparative research a significant corpus of morpho-syntactically marked texts for researchers, students and the broader public who may have little understanding of the complexity of Old Icelandic or other ancient Scandinavian languages. Coupled to the expanded English-language look-up tool, the morphological analyzer/disambiguator will allow scholars with little background in early Scandinavian languages access to this rich, early prose narrative tradition, and allow them to answer questions of significant complexity. The system can also function as an integral component in the teaching of Old Icelandic. Our extension of our work to Old English will greatly enhance for the community of scholars, students and members of the general public interested in materials written in that language. Adapting the underlying program to work with other ancient Germanic languages will pave the way for the development of a series of morphological analyzers for Germanic languages in general, as well as potentially allow for cross-corpora comparisons of specific phenomena.


     By integrating the analyzer and the disambiguation extensions, along with the lookup tool, into established digital library systems, we take advantage of statistical and visualization tools being developed at other institutions, such as those included in Worstat and Xaira; those developed at Imperial College as part of CHLT; and those developed as part of the Perseus Project. Tools that make use of texts marked for morpho-syntactical detail allow for highly accurate searches and comparisons within and across corpora. Such searches and analysis can lead to new understandings of relationships between texts, as well as the discovery of hitherto unrecognized aspects of the historical development of these languages.


      Finally, our morphological analyzer and the disambiguation extensions will be shared in the open-source community, and will be cognizant of the APIs for various shared learning environments. We will explore the packaging of the system as a SCORM learning object.