Workpackage 2: Computational Linguistics
Executive Summary
WP2: Second Year Accomplishments and Third
Year Goals
In the past year, WP2 has focused its labor on the development of a multi-lingual information retrieval tool. This tool has two primary components:
1) a facility to extract translation equivalents from our available digital corpora
2) a user interface allowing users to construct their queries for a traditional
mono-lingual search engine.
We created the core data for the query translation system using a program with a modular design that automatically extracts translation equivalents from any SGML or XML dictionary tagged in accordance with the guidelines of the Text Encoding Initiative or any other user defined DTD. After entering query terms in English, the user is presented with an interface with detailed information to allow them to construct the best translation of the word for their needs. This process can range from the simple elimination of obvious ambiguities and mistakes to a careful consideration of every term. The interface provides a list of translation equivalents for the word or words that the user entered along with an automatically abridged English definition of the word, a link to the full definition for each word, a list of authors who use the words, and data about the frequency of each word in works by the selected authors. We also experimented with automatic extraction of translation equivalents from the parallel Greek and Latin corpora of the Perseus Digital Library and met with only limited success. The methodologies we used were based on work done with parallel corpora where all documents were of comparable size. Because of the heterogenous nature of our corpora and the varying sizes of our available text chunks, we achieved far too many false positives in our results to be of any use to the average user. We were more successful in implementing a query expansion routine that provides the user with possible suggestions of words that were not in their original query by automatically extracting related definitions from the TEI-conformant lexica. This work was integrated with the results of WP1. After a user uses the WP2 multi-lingual search tool to construct his or her query, it can be passed off to the mono-lingual visualization tool of WP1 for further study and refinement.
Our second year also saw the continuation of our efforts to capture feedback about the word study tool and re-integrate it into the database. This took several forms including further editing of texts to achieve better extraction of parallel Greek and English text segments. Our work here had overlap with the alignment work we attempted for the multi-lingual search tool. Because we did not need to subject these texts segments to further statistical processing, our insertion of milestones was more successful for this purpose. We also worked to reintegrate other forms of existing knowledge into our database by mapping the citation scheme of poetic works in older reference works such as the Liddell, Scott, Jones Greek English Lexicon to the more current and widely used standards established by the Thesaurus Linguae Graecae. Finally, we continued work on the document architecture for the lexicon with a particular focus on transformations that will be appropriate for both the print edition and integration into the digital library system.
In our third year, we will turn our attention to our syntactic parsing toolbox. In this phase, we will try to develop programs to discover selectional preferences and subcategorization frames for Greek verbs. Our first step will be to develop an architecture that allows for detailed statistical analysis of sentences in Greek. Our initial hypothesis is that we will be able to refine the interface that we developed for the vocabulary profile tools and then turn to the statistical analysis.
Quarterly
Progress Reports for Year 2
Cultural
Heritage Language Technologies
IST 2001-32745
June 1 August 31, 2003
Workpackage
2:Word Profile Tools
University of Missouri, Kansas City
Faculty of Classics, Cambridge University
Bruce Fraser
Jeffrey A. Rydberg-Cox
A.A. Thompson
1.
Summary of key indicators of project progress
1.1 Overview
of objectives
The practical tools under development in Workpackage
2 can be divided into three groups:
1) Multi-lingual retrieval facilities for digital
library systems (DLSs).
2) Vocabulary profile tools for texts and corpora (in
DLSs).
3) Syntactic parsing tools for Greek texts.
1.2 Overall
assessment of main milestones, results, or deliverables
Our
first year was focused on the development of the vocabulary profile tools and
the integration of user feedback.
Work in this area has continued in this period with a continued focus on
problems of document architecture and establishing unique identifiers for
documents in the system. At the
same time, we began work on the
Multi-Lingual information retrieval facilities. Our next deliverable in this area is a multi-lingual
thesaurus that has been automatically extracted from our parallel corpora. Our initial focus has been on
data structures for aligning our parallel corpora and the most appropriate
algorithms for our use.
2. Work Progress Overview
2.1
Specific objectives for the reporting period
We have had three specific objectives for this
reporting period.
1. To
continue working on the document architecture for the lexicon with a particular
focus on transformations that will be appropriate for both the print edition
and integration into the digital library system.
2. To
continue developing a mechanism that will allow for better integration of
pre-existing expert knowledge into the word profile tool, with a particular
emphasis on mapping the citation scheme of poetic works in older reference
works such as the Liddell, Scott, Jones Greek English Lexicon to the more current and widely used standards
established by the Thesaurus Linguae Graecae.
3.
Preliminary work for D2.3: Tool to Extract Corpus Based Thesauri from
Corpora.
2.2 Achievements
Document Architecture:
The architecture of the Greek lexicon needed to have a design which is suitable for both the print edition and the digitized version. The development of a dedicated document structure or Document Type Definition (DTD) was described in the Progress Report for February-March 2003. However, the individual documents also need to be linked into a unified system which allows for a wide variety of textual interrogation, and they also require suitable XSL transformations for display in the print and digital versions. Document linking is described briefly first.
Linking documents:
We initially contemplated the 'XLink' system, which uses a single file which contains all the links for the entire lexicon, in the form:
<interlink>
<fromPoint
href="Filename.xml#ID_OfSomeElementOrAnchor" />
<toPoint
href="OtherFilename.xml#ID_OfThingToLinkTo" />
<go
/>
</interlink>
However, we decided that it was
possible to link documents in a more straightforward structure, through direct
HREF linking between the documents, with all external linking achieved with the
'headword' of an entry as the target. This created a much simpler architecture.
Each document may also contain a
number of document-internal links, using attributes to the elements
<RefFm>, <RefVL>, and <Form>. Of these, <Form> points
to the HL within a single entry. <RefVL> and <RefFm> occur only
within cross-reference entries, and refer to a variant or other form of the
headword, within whichever entry is the target of the <Ref> element.
The <Ref> element always
refers to a headword, and its attribute carries the only HREF link which can
point to a target headword external to the document. Every document in the lexicon carries a unique
identifier, and the headwords that appear in it carry a 'name' attribute
(applied during production). We anticipate that this structure, in conjunction
with the finely-structured DTD, will support a wide variety of textual
interrogation.
XSL transformations
It is desirable to have a
high-quality display, both for feedback during the authoring process, and also
for reader use, since the lexicon is so densely formatted, using Greek, italic,
bold and bracketted text. We especially wish to avoid the almost unreadable
texts of some earlier classical-language dictionaries. We are therefore using
XSL-FO transformations, which are capable of generating print-quality output,
with precise determination of text detail, as well as whitespace, indents, and
other aspects of the overall appearance of the document. The transformations are still in the
process of development. The output so far achieved is exemplified by the PDF
file included here as an appendix (see Annex 2). An extract from the coding is given here:
<!-- *****
fo:page-sequence mode ***** -->
<!-- Creates
fo:page-sequence elements -->
<xsl:template
match="lex:lexicon" mode="fo:page-sequence">
<fo:page-sequence
master-reference="lexicon-page-sequence">
<xsl:apply-templates select="."
mode="fo:title" />
<xsl:apply-templates select="."
mode="fo:static-content" />
<xsl:apply-templates select="."
mode="fo:flow" />
</fo:page-sequence>
</xsl:template>
<!-- ***** fo:title mode
***** -->
<!-- Creates a fo:title
element -->
<xsl:template
match="lex:lexicon" mode="fo:title">
<fo:title>
<xsl:value-of select="lex:header/lex:file/lex:title"
/>
</fo:title>
</xsl:template>
<!-- *****
fo:static-content mode ***** -->
<!-- Creates
fo:static-content elements -->
<xsl:template
match="lex:lexicon" mode="fo:static-content">
<fo:static-content flow-name="running-head-recto">
<xsl:apply-templates
select="lex:header" mode="fo:static-content" />
</fo:static-content>
<fo:static-content
flow-name="running-head-verso">
<xsl:apply-templates
select="lex:header" mode="fo:static-content">
<xsl:with-param name="side"
select="'left'" />
</xsl:apply-templates>
</fo:static-content>
<!--
<fo:static-content flow-name="footer">
<fo:block>Footer</fo:block>
</fo:static-content>
-->
</xsl:template>
<xsl:template
match="lex:header" mode="fo:static-content">
<xsl:param name="side"
select="'right'" />
<fo:block xsl:use-attribute-sets="lex:normal-font
lex:italic-font"
font-size="9pt" text-align="{$side}"
space-before="{$fo:region-before-extent} - 11pt"
border-bottom="0.5pt solid black">
<xsl:value-of
select="lex:file/lex:title" />
<xsl:text> </xsl:text>
<xsl:value-of
select="lex:file/lex:date" />
</fo:block>
</xsl:template>
<!-- ***** fo:flow mode
***** -->
<!-- Creates a fo:flow
element -->
<xsl:template
match="lex:lexicon" mode="fo:flow">
<fo:flow flow-name="xsl-region-body"
font-size="8.5pt" line-height="10pt"
text-align="start">
<xsl:apply-templates
select="lex:text" mode="fo:block" />
</fo:flow>
</xsl:template>
<!--
<xsl:template
match="lex:AdvUsg | lex:Alt | lex:Ann | lex:Au |
lex:Case | lex:Cllc | lex:Cmpl | lex:Ctxt |
lex:Def | lex:Deg | lex:DInfl | lex:DL |
lex:Ed | lex:Encyc | lex:Envr | lex:Ety | lex:Extra |
lex:Form | lex:Func |
lex:GLbl | lex:Gntv | lex:Gr |
lex:HL |
lex:Indic | lex:Infl | lex:ital |
lex:Lbl | lex:LblR |
lex:Md |
lex:Obj |
lex:Prnth | lex:PrPhr | lex:PrpUsg | lex:Prsd | lex:PS |
lex:QualN |
lex:RefFm |
lex:Spec | lex:Subj | lex:Summ |
lex:title | lex:Tns |
lex:Tr | lex:TrPhr |
lex:Usg |
lex:Vc | lex:VInfl | lex:VL |
lex:Wk |
lex:XR" mode="fo:inline">
<xsl:text> </xsl:text>
<xsl:apply-imports />
</xsl:template>
<xsl:template
match="lex:Lbl" mode="fo:inline">
<xsl:text> </xsl:text>
<xsl:apply-imports />
<xsl:text> </xsl:text>
</xsl:template>
-->
<!--
<xsl:template
match="*[not(preceding-sibling::node())] |
lex:hyph | lex:Hm" mode="fo:inline">
<xsl:apply-imports />
</xsl:template>
<xsl:template
match="*" mode="fo:inline">
<xsl:text> </xsl:text>
<xsl:apply-imports />
</xsl:template> -->
(See also Annex 2.)
Integration of Expert Knowledge:
The development of the word profile tool has faced two major interrelated problems in the integration of primary textual data. The first is that the corpus is not static: new textual information is continually being discovered, especially in the Oxyrhynchus papyri, which have been published regularly since 1898, with approximately another 40 volumes due to appear. The second is that Ancient Greek poetic texts have been edited using multiple citation systems, many of which were devised in the nineteenth century.
A binary search procedure was designed to overcome both problems. The Perseus morphological analyzer can search throughout all relevant textual databases in the DLS, including newly-digitized texts as they become available. This will be particularly useful for Hellenistic (post-classical) Greek texts, where many important new discoveries are being made.
The second problem, of multiple citation systems, is especially severe for early lyric poets such as Sappho, whose works are preserved mostly in fragmentary state. Therefore, as well as the equivalence tables described in the Progress Report for June-November 2002, we have also built tables for the poets, which will be integrated in the search software. The morphological analyzer can then conduct separate searches which are restricted to passages cited in reference works such as the Liddell and Scott Greek Lexicon (LSJ), and match the old citations to the digitized texts. Outputs from the two types of searches can then be used for scholarly research, in tandem or separately.
The equivalence tables will also have a more general reference use for classical literary and linguistic studies, as they will enable readers of LSJ and other reference works to identify passages in the modern editions. They will therefore also be published in print form.
An extract from the introduction to the human-readable version follows. See also a sample from the table, included as Annex 1.
[Extract from Introduction begins]
When using the Greek-English Lexicon of Liddell-Scott-Jones (LSJ), readers face the problem that many citations of the early Greek poets are to editions which are out of print and have been superseded by more recent works which give different numbers to the fragments. Although their comparationes numerorum provide helpful 'back bearings' to the earlier editions, they do not constitute a fast method of linking from citations in LSJ to the texts. In addition, users of the Thesaurus linguae Graecae (TLG) CD-ROM may have no access to them, and citations in Montanari and the DGE cannot always be matched to LSJ. The authors and works covered are summarized below, grouped approximately by genre.
Lyric and iambic poets
Mappings are given for Alcaeus, Alcman, Anacreon, Archilochus, Bacchylides, Callimachus (Aet., Epigr., Hec., Iambi, fragments), Carmina popularia, Corinna, Hipponax, Ibycus, Ion, Lyrica adespota (in Page PMG listed as Fragmenta adespota), Philoxenus, Pindar (Paeanes, Parthenia, Dithyrambi), Praxilla, Sappho, Scolia (Carmina convivalia in PMG), Simonides, Stesichorus, Timocreon, and Timotheus.
Epigrams
Epigrams by lyric and iambic poets are included in their listings. Citations from the Anthology (AP, APl., and App.Anth.) retain the same numbering in most modern editions, apart from the collections of Gow & Page, whose indexes are cited.
Bucolic and elegiac poets
Poets are not included if their early numbering is retained in modern editions. These authors include: Callinus, Demodocus, Mimnermus, Moschus, Pratinas, Semonides, Solon, Theocritus, and Tyrtaeus. However, the fragments of Bion are mapped, and the division of Theognis into Books 1 and 2 is given.
Epic fragments:
Citations of epic fragments in
old editions are mostly from Allen, and sometimes from Kinkel. Mappings from
both are given for Cypria, Epigoni, Il.Parv.,
Il.Pers., Nosti, and Titanomachia. For Hesiodic fragments, readers are directed to the
concordance in Merkelbach-West.
Comic fragments
While most fragments of Aristophanes and Menander have the same numbering in old and new editions, much new material has been discovered, and fragments have been extensively renumbered. References are given to Kassel-Austin's PCG III.2 (for Aristophanes) and VI.2 (Menander). For Menander, mappings are given for line numbers of named plays, and for fragments which appear in Sandbach.
Philosophical fragments
Old editions cite from Diels Vorsokr. or PPF. As the same numbering is retained in Diels & Kranz and KRS, it is not given here. The editions are cited in the bibliography.
Tragic fragments
Citations of Aeschylus and Sophocles have the same numbering in most editions, so these are cited, and mappings are given to Diggle TGFS. Mappings are also given for citations of Aeschylus from Weir Smyth AJP, and, for Euripides, from Arnim to Page Select Papyri, Bond, and Diggle Phaeth.
[Extract from Introduction ends. See also Annex 1].
Preliminary Work for D2.3: Tool to Extract Corpus
Based Thesauri from Corpora
Work
on Deliverable 2.3, a tool to extract a corpus based thesaurus from our
parallel corpus of Greek and Latin texts focused in two areas. First, we looked at document
architecture to allow for more precise alignment of texts. The Perseus Text Display system can
display parallel segments of Greek and Latin texts but the level of granularity
is very high. The only map points
available are the ones defined by the <div> or <milestone> tags and
declared in the <refsdecl> tag of the TEI header. While this mechanism is appropriate for
works such as Greek Rhetoric where the standard citation system is usually no
more than a paragraph or two, it is less appropriate for poetry and drama where
milestones might be 200 or more lines apart and the div structure might present
entire scenes from a play.
Therefore, we have developed a system for automatic text alignment that
takes advantage of a facility in the Perseus text display system that allows us
to get a precise citation that includes a line number for the beginning of any
particular sentence. For example,
a display chunk of book 1 of Homers Iliad in our system will offer parallel translations of lines 33 to 65 but it
is possible to use the byte-offset in the XML file to discover that the
sentence ennmar men ana straton icheto kla theoio begins at line 53. Our approach, therefore, for texts structured with line
numbers like this one is to get the citation information for every sentence in
both the Greek and English version of the texts, round the line number down to
the nearest 10 and then use that citation as an alignment point for chunks of
text. This data is
stored in a SQL database with the following structure:
Attribute | Type | Modifier
-----------+---------+----------
sennum
| integer |
docid
| text |
tail | text |
lang | text |
senid
| text |
senlen
| text |
toplevel |
text |
cit | text |
dcit | text |
Where
dcit is the rounded citation for each sentence. We then select all of the sentences from the Greek and
English versions of the text with the same dcit value and use the resulting
sentences as the basis to calculate possible translation equivalents.
The second portion of this work has focused on
evaluating approaches that will be successful for texts written in Greek,
Latin, and Old Norse. We have
focused our investigations on three different equations, a Chi-squared test, a
t-score and a mutual information score.
In our initial investigations, the chi-squared test appears the most
promising since mutual information scores are highly sensitive to variation in
words that occur with relatively low frequencies. Similarly and t-scores assume the normal distribution of
probabilities of words occurring together and Zipfs law shows that this
assumption is not true. At this
point, work is proceeding with the chi-squared test as we develop the
multi-lingual thesaurus tool.
2.2.2
Progress of Workpackage/Tasks
We are on track to deliver D2.3 on
time.
2.3.2
Work planned for next reporting period
Continued work on multi-lingual
information retrieval tool and document architecture issues. Completion of the citation scheme map
and its integration into the word study tool.
3.1
Co-operation within the consortium, including project meetings
Project meeting in Cambridge between the three
project members JRC, BLF and AAT, 9-10 June, 2003.
Consortium liaison meeting in London, with
representatives of all participating institutions, 12 June, 2003.
3.2
Participation in workshops, conferences, publications
PUBLICATIONS:
"Automatic
Disambiguation of Latin Abbreviations in Early Modern Texts for Humanities
Digital Libraries" in Proceedings
of the 2003 Joint Conference on Digital Libraries"
"Towards a Cultural Heritage Digital
Library" (with members of the Perseus Project) in Proceedings of the 2003 Joint Conference on
Digital Libraries
CONFERENCES:
Joint Conference on Digital Libraries, Houston Texas, May 28 June 2, 2003.
Annex 1: Equivalence Table for poetic texts (extract):
Carm.Pop. = CARMINA POPULARIA TLG 0295, 001
(Bergk III pp.654-88 to PMG pp.449-470; GL V pp.232-269.)
LSJ PMG
1 3 849
2 34 880
3 PMG, Fr.adesp. 37 = 955
4 26 872
5 33 879
6 25 871
7-8 5 851
9 31 877
10 16 862
11 33 879
12 14 860
13 1 847
14 17 863
15 20 866
16 19 865
17 18 864
18 24 870
19 6 852
20 30 876
21 30 876
22A 30 876
22B 15 861
23 22 868
24 4 850
25 35 881
26 13 859
27 7 853
28 IEG II p.11, Adesp.eleg. 17
(TLG 0234, 001 Elegiaca adespota)
29 Ath. 10.455D (83, 2-3)
30 Tryphon p.193, 18
31 Ath. 10.453B (78, 22)
32 Ath. 10.453B (78, 23)
33 Ath. 10.455D (83, 8)
34 IEG II p.93, Panarces (a)
35 Plu. Quom.adul. 54 B 6
36-38 Ath. 14.648F (60, 10-20)
LSJ PMG
39 28 874
40 IEG II p.8, Adesp.eleg. 7
(TLG 0234, 001 Elegiaca adespota)
41 2 848
42 36 882
43 23 869
44 27 873
45 21 867
46-47 Coll.Alex. pp.173, 138
Corinn. = CORINNA Lyr.
TLG 0294, 001
(Bergk III pp.543-53 to PMG pp.325-45; GL IV pp.18-69.)
LSJ citations marked "Corinn.Supp." are given separately.
LSJ PMG
1 5 658
2 9 662
3 20 673
4 10 663
5 8 661
6 6 659
7 3