The Perseus OAI data provider

Federated Resource Discovery and Linking

Overview of Deliverables

In WP3, the Perseus project has developed the following components for citation and meta-data sharing to date:

A general data-provider routine, usable by any co-operating system, and XML schema for metadata fields specific to this infrastructure (beyond the basic Dublin Core fields); (March 03)
Data harvester usable by any co-operating system; (June 03)
Report on Naming Conventions for DL Objects; (December, 2003)
Maintenance Procedures for Naming Conventions; (December, 2003)

These tools will ultimately lead to a federated system where digital library systems can receive hyperlinks created at run-time in another system, allowing for federated resource discovery and linking across systems.

The Perseus OAI Data Provider

Perseus has implemented an OAI data provider that works with the meta-data formats of the Perseus text system. A version of the code is included in the standard distribution of the text system, as cgi-bin/pdataprov. It should be noted, however, that this code as distributed is specific to Perseus; see modification instructions below.

Supported schemata

Pdataprov supports the OAI DC schema, the OLAC schema, and a Perseus schema, currently under development.

The OAI Dublin Core meta-data schema is standard for all OAI data providers. Because the Perseus text system uses the standard Dublin Core field names, the conversion is straightforward: titles are titles, creators are creators, and so on.

The OLAC schema is specifically designed for language and linguistic data. Perseus, as one of the early members of OLAC, supports and endorses this schema.

The Perseus schema is defined by http://www.perseus.tufts.edu/persmeta.xsd. It is intended to include support for the more complex meta-data elements that the Perseus system uses internally. This will permit greater interaction between installations of this text system. Currently the schema includes the status, funder, correction level, and method elements, which give some information about how the text came to exist. It includes the citation scheme, which is crucial for mapping references and commentary lemmata onto the text. And it includes the layout fields, stylesheet and template, by which the collection editor controls the display of the text. All of these fields are drawn from the ptext table.

Ultimately, the Perseus meta-data schema will be able to include fields from other tables, notably cits (citations of other texts), dates (dates mentioned in the text), and refs (place names mentioned in the text).

Site-specific modifications

To use pdataprov with an installation of the Perseus system outside Perseus itself, make the following changes:

$repository_id should be changed to this repository's OAI identifier
references to Perseus:collection and Perseus:text should be changed to use your own system prefix
output from the Identify verb must be edited

Data Provider Implementation details

The main structure of pdataprov is a switch on the verb. Each of the six standard OAI verbs - Identify, ListMetadataFormats, ListSets, ListIdentifiers, GetRecord, and ListRecords - is handled in its own block, which verifies the arguments and then creates the XML output. Global variables $status, $stat_msg, and $error_code hold error information.

Figure 1: pdataprov responding to ListMetadataFormats verb

The Perseus OAI Data Harvester

Because pdataprov supports the current version of the protocols (OAI 2.0 and OLAC 0.4), any conforming harvester can retrieve the data it supplies.

Perseus uses the Perl Harvest module written by the Digital Libraries Research Group at Virginia Tech, available from the OAI Tools page. We then post-process the harvested records to turn them into the form required by the ptext table, and load them in to that table. Because ptext uses the standard Dublin Core field names, this conversion is straightforward for the basic Dublin Core schema. We expect to do the same with the enhanced Perseus schema. Once the records are in the ptext table, the text system can use them in the same ways it would use meta-data for local documents.

Data Harvester Implementation details

The main structure of pdataprov is a switch on the verb. Each of the six standard OAI verbs is handled in its own block, which verifies the arguments and then creates the XML output. Global variables $status, $stat_msg, and $error_code hold error information.

For an individual record, meta-data fields are gathered with the standard subroutine get_doc_info (in Ptext::Info), then re-written in XML form. This happens in routine format_record, called by the blocks for ListIdentifiers, ListRecords, and GetRecord. This routine therefore is the only part of pdataprov that concerns itself with the details of the schemata.

Data Harvester Implementation Examples

The data harvester is integrated with the Perseus search tool - cgi-bin/vor - in the digital library system. This tool allows for simultaneous searching of internal and external metadata.

Figure 2: vor search for 'Sophocles' showing results from both the Perseus Digital Library and External Sites

The Perseus interface also has a simple prototype for link federation that allows users to see results from external sites with implicit search links to key-terms in the Perseus Digital Library generated at run-time.

Figure 3: Page discovered by Perseus OAI search tool displayed with automatically generated links back to resources in the Perseus Digital Library

Next: Naming Conventions and Maintence Procedures