Workpackage 2: Computational Linguistics

 

Executive Summary

 

WP2:  Second Year Accomplishments and Third Year Goals

 

In the past year, WP2 has focused its labor on the development of a multi-lingual information retrieval tool.  This tool has two primary components:

 

 1)  a facility to extract translation equivalents from our available digital corpora

 2) a user interface allowing users to construct their queries for a traditional

mono-lingual search engine. 

 

We created the core data for the query translation system using a program with a modular design that automatically extracts translation equivalents from any SGML or XML dictionary tagged in accordance with the guidelines of the Text Encoding Initiative or any other user defined DTD. After entering query terms in English, the user is presented with an interface with detailed information to allow them to construct the best translation of the word for their needs.  This process can range from the simple elimination of obvious ambiguities and mistakes to a careful consideration of every term.  The interface provides a list of translation equivalents for the word or words that the user entered along with an automatically abridged English definition of the word, a link to the full definition for each word, a list of authors who use the words, and data about the frequency of each word in works by the selected authors. We also experimented with automatic extraction of translation equivalents from the parallel Greek and Latin corpora of the Perseus Digital Library and met with only limited success.  The methodologies we used were based on work done with parallel corpora where all documents were of comparable size.  Because of the heterogenous nature of our corpora and the varying sizes of our available text chunks, we achieved far too many false positives in our results to be of any use to the average user.  We were more successful in implementing a query expansion routine that provides the user with possible suggestions of words that were not in their original query by automatically extracting related definitions from the TEI-conformant lexica.  This work was integrated with the results of WP1.  After a user uses the WP2 multi-lingual search tool to construct his or her query, it can be passed off to the mono-lingual visualization tool of WP1 for further study and refinement.

 

Our second year also saw the continuation of our efforts to capture feedback about the word study tool and re-integrate it into the database.  This took several forms including further editing of texts to achieve better extraction of parallel Greek and English text segments.  Our work here had overlap with the alignment work we attempted for the multi-lingual search tool.  Because we did not need to subject these texts segments to further statistical processing, our insertion of milestones was more successful for this purpose.   We also worked to reintegrate other forms of existing knowledge into our database by mapping the citation scheme of poetic works in older reference works such as the Liddell, Scott, Jones Greek English Lexicon to the more current and widely used standards established by the Thesaurus Linguae Graecae.    Finally, we continued work on the document architecture for the lexicon with a particular focus on transformations that will be appropriate for both the print edition and integration into the digital library system.

 

In our third year, we will turn our attention to our syntactic parsing toolbox.  In this phase, we will try to develop programs to discover selectional preferences and subcategorization frames for Greek verbs.  Our first step will be to develop an architecture that allows for detailed statistical analysis of sentences in Greek.  Our initial hypothesis is that we will be able to refine the interface that we developed for the vocabulary profile tools and then turn to the statistical analysis. 

 


 

 

Quarterly Progress Reports for Year 2

 

 

Cultural Heritage Language Technologies

 IST 2001-32745

 

June 1 August 31, 2003

 

Workpackage 2:Word Profile Tools

 

University of Missouri, Kansas City

Faculty of Classics, Cambridge University

 

Bruce Fraser

Jeffrey A. Rydberg-Cox

A.A. Thompson

 

 

 

1. Summary of key indicators of project progress

 

1.1  Overview of objectives

 

The practical tools under development in Workpackage 2 can be divided into three groups:

 

1) Multi-lingual retrieval facilities for digital library systems (DLSs).

2) Vocabulary profile tools for texts and corpora (in DLSs).

3) Syntactic parsing tools for Greek texts.

 

1.2  Overall assessment of main milestones, results, or deliverables

 

            Our first year was focused on the development of the vocabulary profile tools and the integration of user feedback.  Work in this area has continued in this period with a continued focus on problems of document architecture and establishing unique identifiers for documents in the system.  At the same time,  we began work on the Multi-Lingual information retrieval facilities.  Our next deliverable in this area is a multi-lingual thesaurus that has been automatically extracted from our parallel corpora.   Our initial focus has been on data structures for aligning our parallel corpora and the most appropriate algorithms for our use.

 

 

2. Work Progress Overview

 

            2.1 Specific objectives for the reporting period

 

We have had three specific objectives for this reporting period. 

 

1.  To continue working on the document architecture for the lexicon with a particular focus on transformations that will be appropriate for both the print edition and integration into the digital library system. 

 

2.  To continue developing a mechanism that will allow for better integration of pre-existing expert knowledge into the word profile tool, with a particular emphasis on mapping the citation scheme of poetic works in older reference works such as the Liddell, Scott, Jones Greek English Lexicon to the more current and widely used standards established by the Thesaurus Linguae Graecae. 

 

3.  Preliminary work for D2.3: Tool to Extract Corpus Based Thesauri from Corpora.

 

2.2 Achievements

 

Document Architecture:

 

The architecture of the Greek lexicon needed to have a design which is suitable for both the print edition and the digitized version. The development of a dedicated document structure or Document Type Definition (DTD) was described in the Progress Report for February-March 2003. However, the individual documents also need to be linked into a unified system which allows for a wide variety of textual interrogation, and they also require suitable XSL transformations for display in the print and digital versions. Document linking is described briefly first.

 

Linking documents:

 We initially contemplated the 'XLink' system, which uses a single file which contains all the links for the entire lexicon, in the form:

 

<interlink>

<fromPoint href="Filename.xml#ID_OfSomeElementOrAnchor" />

<toPoint href="OtherFilename.xml#ID_OfThingToLinkTo" />

<go />

</interlink>

 

However, we decided that it was possible to link documents in a more straightforward structure, through direct HREF linking between the documents, with all external linking achieved with the 'headword' of an entry as the target. This created a much simpler architecture.

 

Each document may also contain a number of document-internal links, using attributes to the elements <RefFm>, <RefVL>, and <Form>. Of these, <Form> points to the HL within a single entry. <RefVL> and <RefFm> occur only within cross-reference entries, and refer to a variant or other form of the headword, within whichever entry is the target of the <Ref> element.

 

The <Ref> element always refers to a headword, and its attribute carries the only HREF link which can point to a target headword external to the  document. Every document in the lexicon carries a unique identifier, and the headwords that appear in it carry a 'name' attribute (applied during production). We anticipate that this structure, in conjunction with the finely-structured DTD, will support a wide variety of textual interrogation.

 

XSL transformations

 

It is desirable to have a high-quality display, both for feedback during the authoring process, and also for reader use, since the lexicon is so densely formatted, using Greek, italic, bold and bracketted text. We especially wish to avoid the almost unreadable texts of some earlier classical-language dictionaries. We are therefore using XSL-FO transformations, which are capable of generating print-quality output, with precise determination of text detail, as well as whitespace, indents, and other aspects of the overall appearance of the document.  The transformations are still in the process of development. The output so far achieved is exemplified by the PDF file included here as an appendix (see Annex 2). An extract from the coding is given here:

 

<!-- ***** fo:page-sequence mode ***** -->

<!-- Creates fo:page-sequence elements -->

 

<xsl:template match="lex:lexicon" mode="fo:page-sequence">

  <fo:page-sequence master-reference="lexicon-page-sequence">

    <xsl:apply-templates select="." mode="fo:title" />

    <xsl:apply-templates select="." mode="fo:static-content" />

    <xsl:apply-templates select="." mode="fo:flow" />

  </fo:page-sequence>

</xsl:template>

 

<!-- ***** fo:title mode ***** -->

<!-- Creates a fo:title element -->

 

<xsl:template match="lex:lexicon" mode="fo:title">

  <fo:title>

    <xsl:value-of select="lex:header/lex:file/lex:title" />

  </fo:title>

</xsl:template>

 

<!-- ***** fo:static-content mode ***** -->

<!-- Creates fo:static-content elements -->

 

<xsl:template match="lex:lexicon" mode="fo:static-content">

  <fo:static-content flow-name="running-head-recto">

    <xsl:apply-templates select="lex:header" mode="fo:static-content" />

  </fo:static-content>

  <fo:static-content flow-name="running-head-verso">

    <xsl:apply-templates select="lex:header" mode="fo:static-content">

      <xsl:with-param name="side" select="'left'" />

    </xsl:apply-templates>

  </fo:static-content>

  <!--

  <fo:static-content flow-name="footer">

    <fo:block>Footer</fo:block>

  </fo:static-content>

  -->

</xsl:template>

 

<xsl:template match="lex:header" mode="fo:static-content">

  <xsl:param name="side" select="'right'" />

  <fo:block xsl:use-attribute-sets="lex:normal-font lex:italic-font"

            font-size="9pt" text-align="{$side}"

            space-before="{$fo:region-before-extent} - 11pt"

            border-bottom="0.5pt solid black">

    <xsl:value-of select="lex:file/lex:title" />

    <xsl:text> </xsl:text>

    <xsl:value-of select="lex:file/lex:date" />

  </fo:block>

</xsl:template>

 

<!-- ***** fo:flow mode ***** -->

<!-- Creates a fo:flow element -->

 

<xsl:template match="lex:lexicon" mode="fo:flow">

  <fo:flow flow-name="xsl-region-body"

           font-size="8.5pt" line-height="10pt"

           text-align="start">

    <xsl:apply-templates select="lex:text" mode="fo:block" />

  </fo:flow>

</xsl:template>

 

<!--

<xsl:template match="lex:AdvUsg | lex:Alt | lex:Ann | lex:Au |

                     lex:Case | lex:Cllc | lex:Cmpl | lex:Ctxt |

                     lex:Def | lex:Deg | lex:DInfl | lex:DL |

                     lex:Ed | lex:Encyc | lex:Envr | lex:Ety | lex:Extra |

                     lex:Form | lex:Func |

                     lex:GLbl | lex:Gntv | lex:Gr |

                     lex:HL |

                     lex:Indic | lex:Infl | lex:ital |

                     lex:Lbl | lex:LblR |

                     lex:Md |

                     lex:Obj |

                     lex:Prnth | lex:PrPhr | lex:PrpUsg | lex:Prsd | lex:PS |

                     lex:QualN |

                     lex:RefFm |

                     lex:Spec | lex:Subj | lex:Summ |

                     lex:title | lex:Tns | lex:Tr | lex:TrPhr |

                     lex:Usg |

                     lex:Vc | lex:VInfl | lex:VL |

                     lex:Wk |

                     lex:XR" mode="fo:inline">

  <xsl:text> </xsl:text>

  <xsl:apply-imports />

</xsl:template>

 

<xsl:template match="lex:Lbl" mode="fo:inline">

  <xsl:text> </xsl:text>

  <xsl:apply-imports />

  <xsl:text> </xsl:text>

</xsl:template>

-->

<!--

<xsl:template match="*[not(preceding-sibling::node())] |

                     lex:hyph | lex:Hm" mode="fo:inline">

  <xsl:apply-imports />

</xsl:template>

 

<xsl:template match="*" mode="fo:inline">

  <xsl:text> </xsl:text>

  <xsl:apply-imports />

</xsl:template> -->

 

(See also Annex 2.)

 

Integration of Expert Knowledge:

 

The development of the word profile tool has faced two major interrelated problems in the integration of primary textual data. The first is that the corpus is not static: new textual information is continually being discovered, especially in the Oxyrhynchus papyri, which have been published regularly since 1898, with approximately another 40 volumes due to appear. The second is that Ancient Greek poetic texts have been edited using multiple citation systems, many of which were devised in the nineteenth century.

 

A binary search procedure was designed to overcome both problems. The Perseus morphological analyzer can search throughout all relevant textual databases in the DLS, including newly-digitized texts as they become available. This will be particularly useful for Hellenistic (post-classical) Greek texts, where many important new discoveries are being made.

 

The second problem, of multiple citation systems, is especially severe for early lyric poets such as Sappho, whose works are preserved mostly in fragmentary state. Therefore, as well as the equivalence tables described in the Progress Report for June-November 2002, we have also built tables for the poets, which will be integrated in the search software. The morphological analyzer can then conduct separate searches which are restricted to passages cited in reference works such as the Liddell and Scott Greek Lexicon (LSJ), and match the old citations to the digitized texts. Outputs from the two types of searches can then be used for scholarly research, in tandem or separately.

 

The equivalence tables will also have a more general reference use for classical literary and linguistic studies, as they will enable readers of LSJ and other reference works to identify passages in the modern editions. They will therefore also be published in print form.

 

An extract from the introduction to the human-readable version follows. See also a sample from the table, included as Annex 1.

 

[Extract from Introduction begins]

When using the Greek-English Lexicon of Liddell-Scott-Jones (LSJ), readers face the problem that many citations of the early Greek poets are to editions which are out of print and have been superseded by more recent works which give different numbers to the fragments. Although their comparationes numerorum provide helpful 'back bearings' to the earlier editions, they do not constitute a fast method of linking from citations in LSJ to the texts. In addition, users of the Thesaurus linguae Graecae (TLG) CD-ROM may have no access to them, and citations in Montanari and the DGE cannot always be matched to LSJ. The authors and works covered are summarized below, grouped approximately by genre.

 

 Lyric and iambic poets

 

Mappings are given for Alcaeus, Alcman, Anacreon, Archilochus, Bacchylides, Callimachus (Aet., Epigr., Hec., Iambi, fragments), Carmina popularia, Corinna, Hipponax, Ibycus, Ion, Lyrica adespota (in Page PMG listed as Fragmenta adespota), Philoxenus, Pindar (Paeanes, Parthenia, Dithyrambi), Praxilla, Sappho, Scolia (Carmina convivalia in PMG), Simonides, Stesichorus, Timocreon, and Timotheus.

 

Epigrams

 

Epigrams by lyric and iambic poets are included in their listings. Citations from the Anthology (AP, APl., and App.Anth.) retain the same numbering in most modern editions, apart from the collections of Gow & Page, whose indexes are cited. 

 

 

Bucolic and elegiac poets

 

Poets are not included if their early numbering is retained in modern editions. These authors include: Callinus, Demodocus, Mimnermus, Moschus, Pratinas, Semonides, Solon, Theocritus, and Tyrtaeus. However, the fragments of Bion are mapped, and the division of Theognis into Books 1 and 2 is given.

 

Epic fragments:

 

Citations of epic fragments in old editions are mostly from Allen, and sometimes from Kinkel. Mappings from both are given for Cypria, Epigoni, Il.Parv., Il.Pers., Nosti, and Titanomachia. For Hesiodic fragments, readers are directed to the concordance in Merkelbach-West.
Comic fragments

 

While most fragments of Aristophanes and Menander have the same numbering in old and new editions, much new material has been discovered, and fragments have been extensively renumbered. References are given to Kassel-Austin's PCG III.2 (for Aristophanes) and VI.2 (Menander). For Menander, mappings are given for line numbers of named plays, and for fragments which appear in Sandbach.

 

Philosophical fragments

 

Old editions cite from Diels Vorsokr. or PPF. As the same numbering is retained in Diels & Kranz and KRS, it is not given here. The editions are cited in the bibliography.

 

Tragic fragments

 

Citations of Aeschylus and Sophocles have the same numbering in most editions, so these are cited, and mappings are given to Diggle TGFS. Mappings are also given for citations of Aeschylus from Weir Smyth AJP, and, for Euripides, from Arnim to Page Select Papyri, Bond, and Diggle Phaeth.

 

[Extract from Introduction ends. See also Annex 1].

 

 

Preliminary Work for D2.3: Tool to Extract Corpus Based Thesauri from Corpora

 

Work on Deliverable 2.3, a tool to extract a corpus based thesaurus from our parallel corpus of Greek and Latin texts focused in two areas.  First, we looked at document architecture to allow for more precise alignment of texts.  The Perseus Text Display system can display parallel segments of Greek and Latin texts but the level of granularity is very high.  The only map points available are the ones defined by the <div> or <milestone> tags and declared in the <refsdecl> tag of the TEI header.  While this mechanism is appropriate for works such as Greek Rhetoric where the standard citation system is usually no more than a paragraph or two, it is less appropriate for poetry and drama where milestones might be 200 or more lines apart and the div structure might present entire scenes from a play.  Therefore, we have developed a system for automatic text alignment that takes advantage of a facility in the Perseus text display system that allows us to get a precise citation that includes a line number for the beginning of any particular sentence.  For example, a display chunk of book 1 of Homers Iliad in our system will offer parallel translations of lines 33 to 65 but it is possible to use the byte-offset in the XML file to discover that the sentence ennmar men ana straton icheto kla theoio begins at line 53.   Our approach, therefore, for texts structured with line numbers like this one is to get the citation information for every sentence in both the Greek and English version of the texts, round the line number down to the nearest 10 and then use that citation as an alignment point for chunks of text.    This data is stored in a SQL database with the following structure:

Attribute |  Type   | Modifier

-----------+---------+----------

 sennum    | integer |

 docid     | text    |

 tail      | text    |

 lang      | text    |

 senid     | text    |

 senlen    | text    |

 toplevel  | text    |

 cit       | text    |

 dcit      | text    |

 

Where dcit is the rounded citation for each sentence.  We then select all of the sentences from the Greek and English versions of the text with the same dcit value and use the resulting sentences as the basis to calculate possible translation equivalents.

The second portion of this work has focused on evaluating approaches that will be successful for texts written in Greek, Latin, and Old Norse.  We have focused our investigations on three different equations, a Chi-squared test, a t-score and a mutual information score.  In our initial investigations, the chi-squared test appears the most promising since mutual information scores are highly sensitive to variation in words that occur with relatively low frequencies.  Similarly and t-scores assume the normal distribution of probabilities of words occurring together and Zipfs law shows that this assumption is not true.  At this point, work is proceeding with the chi-squared test as we develop the multi-lingual thesaurus tool.

                        2.2.2 Progress of Workpackage/Tasks

 

We are on track to deliver D2.3 on time.

 

 

                        2.3.2 Work planned for next reporting period

 

Continued work on multi-lingual information retrieval tool and document architecture issues.  Completion of the citation scheme map and its integration into the word study tool.

 


 

            3.1 Co-operation within the consortium, including project meetings

 

Project meeting in Cambridge between the three project members JRC, BLF and AAT, 9-10 June, 2003.

 

Consortium liaison meeting in London, with representatives of all participating institutions, 12 June, 2003.

 

            3.2 Participation in workshops, conferences, publications

 

PUBLICATIONS:

"Automatic Disambiguation of Latin Abbreviations in Early Modern Texts for Humanities Digital Libraries" in Proceedings of the 2003 Joint Conference on Digital Libraries"

 

"Towards a Cultural Heritage Digital Library" (with members of the Perseus Project) in Proceedings of the 2003 Joint Conference on Digital Libraries

 

CONFERENCES:

Joint Conference on Digital Libraries, Houston Texas, May 28 June 2, 2003.

 


Annex 1: Equivalence Table for poetic texts (extract):

 


Carm.Pop. = CARMINA POPULARIA  TLG 0295, 001

(Bergk III pp.654-88 to PMG pp.449-470; GL V pp.232-269.)

 

LSJ                  PMG

 

1                      3          849

2                      34        880

3          PMG, Fr.adesp. 37 = 955

4                      26        872

5                      33        879

6                      25        871

7-8                   5          851

9                      31        877

10                    16        862

11                    33        879

12                    14        860

13                    1          847

14                    17        863

15                    20        866

16                    19        865

17                    18        864

18                    24        870

19                    6          852

20                    30        876

21                    30        876

22A     30        876

22B     15        861

23                    22        868

24                    4          850

25                    35        881

26                    13        859

27                    7          853

28        IEG II p.11, Adesp.eleg. 17

            (TLG 0234, 001 Elegiaca adespota)

29        Ath. 10.455D (83, 2-3)

30        Tryphon p.193, 18

31        Ath. 10.453B (78, 22)

32        Ath. 10.453B (78, 23)

33        Ath. 10.455D (83, 8)

34        IEG II p.93, Panarces (a)

35        Plu. Quom.adul. 54 B 6

36-38   Ath. 14.648F (60, 10-20)

 

 

 

LSJ                  PMG

 

39                    28        874

40        IEG II p.8, Adesp.eleg. 7

(TLG 0234, 001 Elegiaca adespota)

41                    2          848

42                    36        882

43                    23        869

44                    27        873

45                    21        867

46-47   Coll.Alex. pp.173, 138

 

Corinn. = CORINNA Lyr.

TLG 0294, 001

(Bergk III pp.543-53 to PMG pp.325-45; GL IV pp.18-69.)

 

LSJ citations marked "Corinn.Supp." are given separately.

 

LSJ                  PMG

 

1                      5                      658

2                      9                      662

3                      20                    673

4                      10                    663

5                      8                      661

6                      6                      659

7                      3