Cultural Heritage Language Technologies

Workpackage 1

 

Advanced Digital Library Applications

 

Stefan Rueger

Daniel Heesch

Dolores Iorizzo

 

Year 3 Executive Summary

 

            In the third year of CHLT we have completed and refined the development of our text documentation and clustering tool ' Vishnu' which integrates tools for (i) full-text document indexing, (ii) collection building, (iii) keyword extraction, (iv) document search and retrieval, (v) keyword-based document clustering and (vi) a suite of interactive user interfaces. It seeks to discover structure within the set of documents retrieved for a query and to expose this structure to the user to facilitate document search, and then it interacts with the user for resource discovery and discrete searching across corpora. The software is particularly well suited to non-specialists using specialist domains; that is to say those users who have only a partial knowledge of the original language.  We have developed Vishnu for use with corpora provided by the CHLT consortium in the original languages of Old Norse, Neo-Latin and Ancient Greek.  These texts are typically studied by users who are not specialists in classical languages but who nevertheless have a desire to understand the texts.  Vishnu allows them to search these corpora in innovative ways and gives them a tool that begins to bridge the gap between passive awareness and active understanding.

 

Computing keywords at runtime without any pre-processing is computationally expensive. We have solved this problem by building a set of candidate keywords from the entire pool of words based on statistical criteria and then compute keywords at run-time from this limited reduced set.  In Vishnu keywords serve two purposes (a) they are primarily used for document representation and document clustering, yet (b) they can also be useful in their own right by providing a user with a core vocabulary of salient terms associated with particular topics as well as potentially interesting co-occurrences of salient terms both within documents and between documents. A user who has only a limited grasp of a domain can thus acquire a more thorough understanding of a topic and consolidate the passive knowledge they may possess by exploring the set of returned keywords within the documents themselves.  For users who are well-acquainted with the subject, the keywords offer an overview of the collection which allows them to test hypotheses about the occurrences of keywords in texts, and also allows a refinement of understanding about how exactly the meaning of a keyword functions in a particular text.

 

Vishnu offers three visualizations: two of these use the result of the keyword-based clustering of retrieved documents and invite the user to hierarchically browse the document set.  A third visualization uses keywords to directly determine the position of a document on the display.  Keywords are arranged in a circle around a document cluster which is attached by springs to all of the keywords in the set.  The more salient a keyword for a document, then the stronger the force in  pulling the document towards that keyword.  The three interfaces are complementary and their relative appeal will differ between users.   By providing a set of alternative interaction models, users can switch between different visualization environments for different purposes that will enable new kinds of research.

 

 

Our final efforts focused on six areas:

 

 

1) Evaluation

 

We have used the set of three different interfaces to carry out a user study to

enquire into the relationship between user satisfaction and aesthetic appeal of interface. The study has recently been submitted to HCI 2005. A paper on formal aspects of the algorithms involved and a quantitative evaluation of cluster performance has been submitted to D-Lib.

 

2) Consolidation

 

Consolidation work involved the development of scripts for software installation

and automated collection building; this helped up to bring to fruition our full programming and project integration within the consortium.

 

 

3) Integration

 

As part of our ongoing integration effort, we have insured successful installation

of Vishnu throughout the consortium. It has been installed at all sites and has dedicated servers at Tufts University where it is integrated into the CHLT/ Perseus Digital Library System, the University of Missouri, Kansas City which is also a CHLT also a CHLT anchor site, and at Imperial College London where it runs in the Department of Computing and the Newton Project, Centre for the History of Science. The software and documentation is available online at http://faya.doc.ic.ac.uk:8800/chlt/vishnu.html.

 

(Note: Vishnu runs on Mozilla Firefox: http://www.mozilla.org/products/firefox/  )

 

 

 

4) Dissemination

 

The development and refinement of Vishnu has lead to a number of publications and conference papers in year three: DMS 2004, JCDL 2004 and ECDL 2004, as well as the following: 

 

S Rueger, "Putting the user in the loop: visual resource discovery",  Proc 3rd

International Workshop on Adaptive Multimedia Retrieval (AMR, Jul 2005,

Glasgow), Springer LNCS, 2005

 

   http://mmir.doc.ic.ac.uk/www-pub/amr-2005.pdf

  

 

D Heesch and S Rüger: Query-based keyword extraction and document clustering

    for information retrieval and knowledge consolidation (in preparation for D-Lib)

 

S Rueger: Multimedia Information Systems, Fakulti Sains Komputer & Teknologi

Maklumat, Universiti Putra Malaysia, 18 Jul 2005

 

S Rueger, Multimedia Systems, Laboratoire GILCO, INP Grenoble, 7 Mar 2005

 

S Rueger, Multimedia Systems, Laboratoire ICAR, Institut de la Communication, Univ

Lumi\`ere Lyon2, 10 May 2005

 

 S Rueger: What's a falcon got to do with baseball? New Paradigms in Multimedia

Digital Libraries, Univ of the Arts, Central St Martin, London, 27 Jan 2005

 

S Rueger: New Paradigms in Multimedia Management and Search, BT, Adastral Park,

Ipswich, 22 Sept 2004

 

S. Rueger and D. Iorizzo, Visualizations for Cultural Heritage Language Technologies,

            ECDL, Bath 2004.

 

 S Rueger: New Paradigms in Multimedia Management and Search, Institution of

Electrical Engineers, Kingston-upon-Thames, 7 Sept 2004

 

S Rueger: New Paradigms in Multimedia Management and Search, University of

Bamberg, Ger, Computer Science Seminars, 10 Aug 2004

 

 

5) Open Source Technology

 

CHLT has an open source policy, in line with the Berlin Declaration; the technical resources used in WP1 are specified below:

 

MG:

 

MG is an open source full-text retrieval system that makes efficient use of disk resources by storing the index, and the text that is indexed, in compressed form. The MG system is

described in the eponymous book Managing Gigabytes. To download MG the URL is: http://puka.cs.waikato.ac.nz/html/mg.html

 

 

 

 

Apache Lucene

 

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.  It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.  Apache Lucene is an open source project available for free download from Apache Jakarta: http://lucene.apache.org/java/docs/

 

 

Tomcat

 

Tomcat is a servlet container that is used in the offcial reference implementation for the

Java Servlet and JavaServer Pages technologies. The Java Servlet and JavaServer Pages

specifications are developed by Sun under the Java Community Process. Tomcat is developed in an open source and participatory environment and released under the Apache Software License: http://jakarta.apache.org/tomcat/

 

LiquidLnF

 

LiquidLnF is a Java Swing Look and Feel archive used for enhancing the style of the graphical user interfaces: https://sourceforge.net/projects/liquidlnf

 

 

6) Final Report on Cluster Visualisation Tool

 

Final Report on Cluster Visualization Tool

 

Stephan Rueger, Daniel Heesch and Dolores Iorizzo, Imperial College London

Dept. of Computer Science & Centre for the History of Science

May 2005

 

1. Introduction

 

This report presents a final summary of the work carried out at Imperial College London during the period June 2004 to May 2005. The principal objectives were the completion of work on a set of tools for the clustering and the visualization of text documents and the application of these tools to CHLT texts, including those of Newton, Old Norse, Ancient Greek and Neo-Latin collections. We offer these tools as part of an integrated system that performs full-text indexing, extraction and indexing of keywords, full-text search, and clustering and visualization of returned document sets.

 

All parts of the system are based on open-source software and are themselves subject to the GNU general public licence. The system can be used as both a stand-alone application with collections being accessed locally or remotely, or as a web-application. The server side uses servlet technology, which has become part of Java since 1.3 providing high-level abstractions and support of a variety of services, such as dynamic HTML and socket communication. They are particularly powerful in conjunction with client-side applets. While we sought to implement as much of the code in platform-independent Java, external software such as MG, which we employ as one of two alternative search engines, is written almost entirely in C, and so is, for efficiency reasons, the server side code for the clustering module.

 

2. Indexing

 

We have developed a standard Java API to provide a uniform way of accessing the functionality of the indexing programs used for collection construction. Our present system involves two sets of indices. The first set consists of indices required for full-text search. At present we provide this indexing functionality through either MG, a widely used open source software largely written in the C programming language, and Lucene, a set of Java tools developed as part of the Apache Jakarta Project. The second set of indices are related to the frequency and other statistics of candidate keywords within all documents of a collection. The full text and candidate keyword indexers are functionally quite distinct, but core commonalities exist and are contained in an Indexer super class.

Indexer super class

 

package Indexer;

 

abstract public class Indexer

{

    public static final int DEFAULT = 0;

    public static final int DIP = 1;

    public static final int REG = 2;

    public static final int LEM = 3;

    public static final int SYN = 4;

 

    public static final String[] types = new String[]{"full","dip","reg","lem","syn"};

 

    int type;

    String name;

    String outputDir = null;

    String inputDir = null;

    String limiter = null;

    String collName = null;

    String VIS_HOME = null;

   

    public Indexer(String n)

    {

                  name = n;                   

    }

 

    public void setRootDirectory(String root)

    {

                  VIS_HOME = root;

    }

 

    public void setIndexingType(int t)

    {

                  type = t;

    }   

   

    public void setOutputDirectory(String s)

    {

                  outputDir = s;

    }

   

    public void setInputDirectory(String s)

    {

        inputDir = s;

    }

   

    public void setLimiter(String s)

    {

        limiter = s;

    }

   

    public void setCollectionName(String c)

    {

                  collName = c;

    }

 

    abstract public void startIndexing();

}

 

In the following we briefly describe the two full-text search engines we currently deploy before presenting rationale and methodology of our candidate keyword indexer.

 

Full-text indexing

 

We currently deploy two alternative open source tools for full-text indexing and search, one is MG, the other Lucene. In order to use other full text search engines, a wrapper needs to be written that agrees with our Indexer API.

MG indexing

MG is an open-source indexing and retrieval system for text, images, and textual images. MG is covered by a GNU public licence. The current version of MG is available for ftp from http://www.cs.mu.oz.au/mg/. The MG (Managing Gigabytes) system is a collection of programs which comprise a full-text retrieval system. A full-text retrieval system allows one to create a database out of some given documents and then do queries upon it to retrieve any relevant documents. It is "full-text" in the sense that every word in the text is indexed and the query operates only on this index to do the searching. A first introduction to MG can be found at http://www.mds.rmit.edu.au/mg/intro/mgintro.html, and a more detailed account is provided in the eponymous book “Managing Gigabytes”.

 

For indexing, MG’s mgbuild script expects to run an application or a shell script that delivers a series of documents to it via standard output. A basic sample script is provided and it is up to the user to devise a suitable script if the basic functionality proves inappropriate. We have written a C program mg_get.c which fulfills this role. It reads in a file (docs.lst) which contains the paths of all the documents to be indexed, reads them one by one and passes the content to mgbuild.

We have written a Java wrapper for MG that uses the Java process API to call the mgbuild script. Information about the docs.lst file and the folder in which indices are to be stored are passed on as arguments to mgbuild.

 

Lucene indexing

 

Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. It is part of the Apache Jakarta Project whose principal goal it is to provide commercial-quality server solutions, based on the Java Platform, developed in an open and cooperative fashion.

 

The indexing is taken over by the IndexWriter class that is instantiated with two arguments, the name of the directory for the indices, and an instance of org.apache.analysis.Analyzer. The Analyzer is a super class that provides the core functionality common to various document processing subclasses. The Stop Analyzer, for example, is a standard Java Tokenizer converting all strings to lowercase and filtering out stop words from the index. At present Lucene only provides the full suite of analyzers for English and German but the extension to other language is under way. As for MG, we have written a Lucene “Wrapper” which harvest the Lucene indexing functionality and which also inherits from the abstract Indexer class.

 

Candidate Keyword Indexing

 

The natural features of text documents are words or phrases, and a document collection can contain millions of such distinct features. The often-used word histogram representation of documents consequently leads to very high dimensional vectors. The intrinsic problem with such representations is that any two randomly picked vectors in a high-dimensional hypercube tend to have a constant distance from each other, no matter what the measure is. As a consequence, clustering algorithms that are based on document distances become unreliable.

 

To give a numeric example for the effect of high dimensionality, let x,y [0,1]n be drawn independently from a uniform distribution. The expected value of their sum-norm distance is n/3, with a standard deviation of. For n=1,800 (corresponding to a joint vocabulary of just 1,800 words for a word histogram representation) this means a typical distance of |x-y|1 = 600 ± 10. With increasing n, the ratio between standard deviation and vector size gets ever smaller, as it scales with n-1/2. This is a generic statistical property of high-dimensional spaces with any standard distance measure, and can be traced down to the law of large numbers. Although word histogram document representations are by no means random vectors, each additional dimension tends not only to spread the size of a cluster but also to dilute the distance of two previously well-separated clusters. Hence, it seems prohibitive involving all semantic features (e.g., the words) of a document collection for document clustering. Even after applying feature reduction, the number of features remains large. In early experiments with 548,948 documents (TREC CDs vol3 and vol4, including articles of the Los Angeles Times, the Financial Times, the Federal Register, Congress Records and the Foreign Broadcast Information Service, see http://trec.nist.gov), a candidate Keyword had to appear in at least three documents and in no more that 33% of all documents. This resulted in a vocabulary of around 222,872 candidate keywords. In our system we store, at index time, around 200 candidate keywords per document. A set R of documents returned by a query may still have a candidate-keyword vocabulary of well over 10,000 different words. As an example, the four sets of top 500 documents returned by the queries “mafia,'' “Clinton,'' ”cancer'' and “computer'' use a vocabulary of between 14,000 and 17,000 different candidate keywords.

 

The CK wrapper provides a Java environment from within which the C programs responsible for the extraction and indexing of candidate keywords can be started. The location of the source documents and the folder for the indices can be set through public methods defined in the Indexer super class. This information is passed on as arguments to a shell script, ck_index.sh, which in turn calls the backend C program, the functional core of the module.

 

Java GUI for indexing

 

To aid the construction of new collections and to demonstrate the use of the API of the afore-mentioned Java wrappers, we have written a small application that provides a graphical user interface with all the basic functionality hitherto only available through the command line. As can be seen in the screenshot of the GUI in Figure 1, the essential information is the location of the documents to be indexed and the path to the places where the MG and CK indices are to be kept. Other additional information such as the name of the collection is not critical but are useful for subsequently informing the user about the content of particular collections.

 

 

Figure 1: GUI for collection building

3. Full Text Search

 

Full text search is taken over by either Lucene or MG, depending on which software was used to create the indices. The system can determine the indexer to be used by reading in the appropriate field in the collection’s config.xml file. For both MG and Lucene, we have written Search wrappers that implement a basic searcher interface as shown below.

Searcher Interface

 

package server.Search;

 

import java.util.*;

 

public interface SearchInterface

{

    public void search(String query, String collection, int view);

 

    public Vector getDocIdentifier();

 

    public Vector getDocDescriptions();

 

    public Vector getDocContent();

}

 

MG Search

 

MG’s search engine can be run in a number of modes. We currently ask MG to perform ranked query searches and to return for each relevant document its index and a brief section of the text. The query program already has the functionality to extract short samples of text. We modified this method so that it would skip as many non-alphabetical characters as were found at the beginning of the document. An extra command directive was further added to the query program to be able to retrieve the full content of relevenat documents.

Lucene Search

 

Full document search is achieved through the collaboration of an IndexSearcher, an Analyzer and a QueryParser. The query parser is constructed with an analyzer used to interpret the query in the same way the Index was previously interpreted. The Query object contains the results from the QueryParser which is passed to the searcher. The searcher results are returned in a collection of Documents. The Lucene wrapper is again given in the technical appendix. Note that it implements the Searcher interface which specifies core methods required by our application (e.g. getDocContent(), search() etc).

 

4. Hierarchical Collection Structure

 

We have adopted what is at the same a more systematic and a more dynamic way of associating the visualization with collections. According to the new model, which has in a similar way been adopted by GSDL (Greenstone Digital Library), collections are grouped to form sites. At each level of the hierarchy, we can specify properties and services that hold for all the collections contained in the corresponding sub tree. At start up, the system recursively scans the sites directory, initializes and presents to the user the various collections found.

 

At the lowest level of the hierarchy we allow each collection to consist of a number of “views'', each with its own set of indices. The default view is the set of unprocessed documents, derived views or surrogate documents are the result of such operations as stemming, stopping, or lemmatization of individual words. Thus, generating views amounts to the formation of equivalence classes based on morphological, semantic or functional similarity. Applying a stop list to a collection, for example, can be construed as a mapping of all commonly occurring words to an empty class, and a mapping of all other words to themselves.

 

This structural extension has implications on how collection building and search is carried out. In the process of collection building, documents are first imported from a specified directory, with documents for each view placed in subdirectories on their own. Each such subdirectory is assumed to contain a file with a list of all documents to be indexed for that particular view. The indexer creates a set of parallel indices in the collection's index directory with each subdirectory being named by the first three letters of the operation that characterizes the view (def=default, ste=stemming, lem=lemmatization, syn=synonyms etc.). Document processing is not currently part of the integrated indexing software, i.e. the indexer assumes the documents to be pre-processed, and collection building thus consists, at the highest level, of the two steps of document processing

and document indexing.

 

At present the user specifies the desired view as part of the query string in a similar fashion as advanced query formulation is achieved in the input field of conventional text search engines like google. E.g. “\l'' indicates that the query is to be lemmatized and the search to be carried out in the lemmatized documents. Because collections are likely to differ in the number and types of views they offer, it might well be justified to use an additional graphical component to select the view. When no view is specified, the system assumes the default view (i.e. the query remains unprocessed and indices are retrieved from the index/def directory). While both full text indices and candidate keyword indices are view-specific, the actual document to be retrieved is still the original, unprocessed document.

 

5. Clustering

 

Our clustering is based on the joint document/keyword distribution. While the documents are available from the full-text search engine, the keywords still need to be determined. Because we have already narrowed down the set of potential keywords to our set of candidate keywords at index time, our task is relatively straightforward and will be detailed below.

Keyword computation

 

Processing the results is done in real time and has to be fairly quick so we have taken some care to try and use algorithms that are efficient and as close to O n log n as possible. We use trees as data structures as trees allow the fast insertion and location of items in sorted data sets. Trees lend themselves to recursive algorithms as their structure is recursive. In using recursion with the trees we have removed instances of tail recursion. When the query program returns a result it is in the form of a document number and a short piece of the document text from each document or snippet (see Appendix A), in the search result set. The ‘result processing’ program takes the document number of each document and loads the matching ordered vector of the indices and the populations of candidate key words. It reads these from the variable length record file compiled in section 4.2. The snippet is stored in an array for later use. A tree is built containing the index of each word in the result set, the search set document frequency and the archive set document frequency. This tree is ordered by archive set document frequency and as the vectors are also ordered the same way the iteration filling the tree with the contents of each vector is shuffled so as to generate a shallower tree than would otherwise be the case. The median values are entered into the tree before the extreme values. The deeper the tree the more the time taken build it approaches O(nm)  where n is the number of different candidate key words in the search set while m is the (large) number of references to possibly interesting words in the document vectors. An optimal tree should take O(m log n)  time to build, log n being the depth of the tree. Once the tree is built the search set document frequency for each candidate key word is known. The tree is then iterated to calculate the value of ‘interesting weight’ (see section 2.1) for each candidate key word in the tree, using the method explained in section 2.1. A new tree ordered by this ‘interesting weight’ is built by pulling leaves from the old tree and inserting them into the new tree, thus destroying the old tree. The top 100 elements of this new tree are used to build an array of key word indices. The rest of the candidate key words are discarded. This array is used to build a translation table that maps the array position of the key word to the order of the index value of the words. The table filled in the order of decreasing key word weight and then sorted using ‘quick sort’ in the order of increasing key word index value.

The next task is to build a matrix of the incidents of key words in documents. The documents represent the rows and the key words the columns. As the vast majority of the entries in the matrix will be zero we opt for a sparse matrix design. In this incarnation a sparse matrix consists of an array of linked list representing the rows of the matrix. The linked lists are lists of objects that describe the column of the matrix, the value of matrix node and the address of the next object or null. There need be no order to the objects in a row. There must not be any duplicate references in a row to the same column. To construct this sparse matrix we iterate through each of the vectors containing the indices of candidate key words for the documents in the search result. As indices in both the translation table and the vectors are ordered in order of increasing key word index value, the two lists of indices can be efficiently compared without having to literally check each element in one against each element in the other. For example, if the current index in the translation table is higher than the current index in the vector we can skip to the next index in the vector. If the indices match then an entry is added to a sparse matrix row for that document. A sparse matrix is quickly built up that contains the key word versus document information for the whole search result set.

At this point those documents that contain none of the key words are eliminated from the matrix, as none of the visualizations can say anything about them. In the vast majority of cases, documents that are eliminated are those that have been ranked by the search engine below the 400th place in the result set.

 

Document Representation

 

Document clustering has attracted much interest in the recent decades, and much is known about the importance of feature reduction in general, and clustering in particular. However, little has been done so far to facilitate feature reduction for document clustering of query results. In contrast to the conventional approach of using a tf-idf weighting scheme, we suggest ranking the importance of each such candidate keyword j with a weight

 

where |R| is the total number of documents in the set R of returned documents, rj is the number of documents in R containing word j, and dj is the number of documents in the whole document collection D containing j. The second factor prefers words with medium-valued matched-document frequencies, while the first factor prefers words that specifically occur in the matched documents. The highest-ranked words are meant to be related to the query. Indeed, we have “software'', “IBM'', “UNIX'' etc as the top-ranked words when querying for “computer''. This seems to be a powerful approach to restrict the features of the matched documents to the top k ranked words, which we will call the keywords. One important aspect is that the features are computed at query time. Hence, when the above query is refined to “computer hardware'', a completely new set of keywords emerges automatically.

 

For each matched document i we create a k-dimensional vector vj, whose j-th component vij is a function of the number of occurrences tij of the j-th ranked keyword in the document i:

This is a variation of the tf-idf weight that gives less stress to the term frequency tij (the number of occurrences of the jth ranked keyword in document i). We project the vector vi onto the k-dimensional unit sphere obtaining a normalized vector ui that represents the document i. We deem the scalar product of ua and ub (i.e. the cosine of the angle between vectors va and vb) a sensible semantic similarity measure between two documents a and b in the document subset R returned by a query with respect to the complete document collection D.

 

u may be viewed as a document representation matrix where the row vector ui is a k-dimensional representation of document i and uij is viewed as the importance of keyword j for document i. In particular, uij = 0 if and only if document i does not contain keyword j. The number of features k can be controlled by the experimenter, and our experiments using the TREC data of human relevance judgments have shown that k=10 yields superior clustering results. Note that even if only the top ten keywords are used for the clustering and document representation, we might still display more keywords on the screen to assist the user in his or her search.

 

Buckshot clustering

 

Post-retrieval document clustering has been well studied. We deploy a variant of the Buckshot algorithm: each cluster contains a certain number of document vectors and is represented by their normalized arithmetic mean, the so-called centroid vector. In the first phase, hierarchical clustering with complete linkage operates on the best-ranked 150 documents. This can be done in a fraction of one second CPU time. Hierarchical clustering has the advantage that one can either prescribe the number N of clusters or let the number of clusters be determined by demanding a certain minimum similarity within a cluster. Either way, once clusters within the top-ranked documents are identified, their N centroids can be computed and used as a seed for standard iterative clustering of the remaining documents. This algorithm scales linear with the number |D| of documents and the number N of clusters. In our experience, one cycle of iterative clustering is not only adequate but also preserves the cluster structure given by the top-ranked documents. 1,000 documents can thus be clustered in well under one second on a 2~GHz PC.

 

JNI

 

For reasons of efficiency, some of the critical server-side code has been written using the C programming language. These are the programs responsible for keyword computation and document clustering. To access this functionality, we use the Java Native Interface (JNI), which is the native programming interface for Java and part of the JDK. By writing programs using the JNI, one can ensure that the code is completely portable across all platforms. The JNI allows Java code that runs within a Java Virtual Machine (VM) to operate with applications and libraries written in other languages, such as C, C++, and assembly. Programmers use the JNI when an application cannot be written entirely in the Java programming language. For example, when the standard Java class library may not support the platform-dependent features needed by the application, when one already has a library or application written in another programming language, when some small portion of time-critical code is to be written in a lower-level programming language, such as assembly, and then have the Java application call these functions. The JNI framework lets native methods utilize Java objects in the same way that Java code uses these objects. A native method can create Java objects, including arrays and strings, and then inspect and use these objects to perform its tasks. A native method can also inspect and use objects created by Java application code. A native method can even update Java objects that it created or that were passed to it, and these updated objects are available to the Java application. Thus, both the native language side and the Java side of an application can create, update, and access Java objects and then share these objects between them.

 

6. Visualization

 

We have developed three visualizations of the set of returned documents. Two of these, the Sammon Map and the Dendro Map, make direct use of the clustering result. The application as a whole offers the possibility of browsing the same result set in several different ways simultaneously. The cluster-based visualizations give a broader overall picture of the result, while the Radial visualization allows the user to focus on subsets of keywords. Also, as the clusters are approximations that bring to the fore particular keywords, it may be useful to return to the Radial visualization and examine the effect of these keywords upon the whole document set. The Radial visualization will perhaps be more fruitful if the initial keywords match the user's area of interest. The Sammon Map will let the user dissect search sets and re-cluster subsets, gradually homing in on target sets. Again the user may have recourse to the Radial visualization and apply those keywords to the whole result set. In the end, he or she will need to read only a small fraction of the retrieved documents. The cluster-based visualizations give a visual analogy of the structure implicit in the library classification scheme. The Radial visualization throws up the effects of keywords that cause cross references between documents, and allows the user to skim between subject areas.

 

Ranked List

 

The simplest of our visualizations is a conventional ranked list display of the results set. This is the style adopted by most web search engines such as Google or AltaVista. For each hit, we display the document number, which itself is a link to the original document, and a small extract of the document.

 

 

Figure 2: Conventional Ranked List display of results sets.

Sammon Map

 

This visualization uses a non-linear Sammon transformation to map the locations of document cluster centroids to points on a two-dimensional plane. The Sammon map is computed using an iterative gradient search as detailed in Sammon’s original paper.  The algorithm aims to represent n-dimensional points in a lower-dimensional space while attempting to preserve the pair-wise distances between the objects. The arrangement of the clusters on the display is such that their mutual distances are indicative of the relationships found in the original higher-dimensional space.

In the central panel each cluster is represented by a circle and is labelled with its most frequent keyword. The radius of the circle informs about the size the cluster. The distance between any two circles in the graphic panel is an indication of the similarity of their respective clusters: the nearer the clusters on the display, the more likely the documents contained within will be similar.

The Sammon visualization allows the user to select clusters and thus to home in on smaller sets of documents. When the mouse passes over the cluster circle a `tool tip' box in the form of a pop-up menu appears. Operations such as “select” and “drill down'' can be performed for any particular cluster or set of selected clusters.

 

 

Figure 3: Basic Sammon Map Visualization.

 

The first item in the cluster popup menu shows the number of documents contained in that cluster. Choosing this item displays a scrolled table of cluster keywords in the pane on the left-hand side of the visualization and a scrolled list of cluster document hot-links and snippets appear in the scrolling text window at the bottom of the display.

The drill down item in the pop-up menu performs a redisplay of the documents in the set of selected clusters. Clustering is performed online on this new subset of documents and a new Sammon map is created. Unlike in the Dendro map, where the undrill functionality is associated with the current root node and is offered to the user when she moves over the central node, no such root node exists in the Sammon map, and thus the operational symmetry between the two visualizations breaks downs at this point: upon undrill the level indication at the top of the display is incremented and the back button enabled. The select item in the pop-up menu marks a cluster as selected and signals this with a flag. The other menu items serve to label the cluster with four significant keywords and are not selectable. An example of a drill-down operation is presented in the Figure 3 where the documents to be clustered are the union of the three clusters selected in Figure 2 (recital, lisztian, and recital).

 

On the right of the main visualization, we display a small extract of the document similar to that on the ranked list display. Clicking on the document identifier opens up a new frame displaying the full content of the document.

 

 

Figure 4: Drill Down on the union of three selected clusters.

 

At the new level, we may choose a new cluster. The new table of keywords on the left is thus associated with the cluster labelled “polonaise”. The four keywords displayed in the pop-up menu gives the user a rough idea of what to expect.

The table to the left shows the list of most prominent keywords and the frequency of their occurrence. It includes a box field which can be selected. At the bottom of the table is a filter button that makes the scrolling text window display only the hot-links and snippets from documents that contain the selected keywords. This filtering functionality is placed in its own class and can thus readily be used in conjunction with other visualization.

 

Radial Map

 

The Radial Map is similar to VIBE, Radviz and to Lyberworld. Radviz places the keyword nodes in a circle as dimensional anchors and the document nodes in the middle as if suspended by springs connecting them to keyword nodes. The greater the content the stronger the effect of the springs on the location of the document. Hence, we make direct use of the representation matrix u without explicit clustering. Radial adds a level of user interaction to the metaphor introduced by RadViz and VIBE. Building on VIBE, Lyberworld takes a similar idea into three dimensions. In Lyberworld vector addition is used to position the documents nodes within a relevance sphere. The document's keyword content creates a vector between the centre axis of the sphere and the position of that keyword on the sphere's surface. The radius of the sphere is defined by the range of possible vector lengths. In Lyberworld the relevance sphere can be rotated so that the 2D computer display can give more of a clue as to the 3D location of the document node relative to the keyword nodes. Also the relative attractiveness of the keyword nodes can be enhanced to pull related documents towards them. Radial, staying with only two dimensions, dispenses with some of the perceptual complexity implicit in rendering a three-dimensional model on a two-dimensional screen.

 

Radial also uses the statistically collected keywords to differentiate the documents. Initially, the twelve most highly ranked keywords are displayed in a circle. Any documents in the search set that contain those keywords are placed using a similar vector sum within the circle. As the mouse passes over the document nodes, a bubble displays a descriptive piece of text from the document. The dimensions of the circle are more arbitrary than in Lyberworld and if the display were simply based on a flat sum of vectors it would be possible for the document nodes to be outside the circle. However we have constrained their positions by projecting the radial locations through the arc-tangent function, with the result that the documents cannot be moved outside of the circle.

 

 

 

Figure 5: Radial Map Visualization

 

 

Dimensionality reduction has the effect that locations of document nodes become ambiguous. There may be many reasons for a node to have a particular position. To mitigate this ambiguity in Radial the user can click on a document node, and the keyword nodes that affect the location of document are highlighted. We believe that this is a novel and useful feature. Clues to the effects of different dimensions are also given when a keyword node is selected with the mouse: the document nodes that contain the keyword are highlighted. Similarly to Lyberworld the vector effect of a particular keyword node on the document set can be enhanced. However, in Radial this is achieved by grabbing the keyword node with the mouse and moving it away from the edge of the circle: All documents that contain this keyword follow the movement of the keyword, accordingly. Manual clustering can be effected by placing several keyword nodes together. Alternatively, a button allows the displayed keyword arrangement to be automatically clustered using the columns of the matrix u.

 

The choice of keywords used in the display can be enhanced and reduced by clicking on two visible lists of words. Zoom buttons allow the degree of projection to be increased or reduced so as to distinguish between document nodes around the edges of the display or at the centre.

 

The Radial visualization is meant to be an interactive tool for structuring the document set according to one's own preferences by shifting keywords around in the display. The shortcoming of the Radial visualization is that it can only say something about the documents in the result set that contain the particular set of keywords, and increasing the number of keywords eventually makes the whole display very difficult to interpret. This is where the cluster-based visualizations prove their merits.

 

Dendro Map

 

The Dendro Map aims to achieve a synthesis between the binary tree clustering and space-filling, focus-and-context display techniques such as tree maps, hyperbolic browsers or fractal layouts. The rationale behind this hybrid approach lies in combining the virtues of both approaches (uniformity, effective space-utilisation, respectively) while trying to remain computationally inexpensive so as to be able to deal with large text corpora.

 

The Dendro Map provides a uniform view of hierarchically organized information. The nodes are binary and the tree radiates out from the root node, which is positioned at the center of the display. The distances between successive nodes shrink with their distance from the root and the Dendro Map thus mimics the effect of mapping a hyperbolic distortion of the tree onto the Euclidean disc. It differs from the hyperbolic approach in two fundamental aspects. First, in the Dendro Map, the tree is clipped at the fringe of the circular display after a fixed number of nodes. By contrast, a hyperbolic visualization relies on an invertible, and thus information-preserving mapping of the entire tree hierarchy. In practice, however, the same limitation regarding the number of levels also applies to the hyperbolic display as distances towards the fringe become exceedingly small to be resolved. Secondly, unlike the hyperbolic visualization, the Dendro Map displays context information only in one direction as parent nodes disappear from sight as lower level nodes are placed into focus. While this may be considered a restriction for undirected browsing tasks, it facilitates directed search by maximizing the display space for the children of the current top-level node.

 

 

Figure 6: Plane-filling binary tree in the Dendro Map

 

Navigation through the tree is achieved by selecting any node of the current display as the new focus. The node is moved into the centre and new nodes will appear at the fringe of the display disc. Hovering over a node lets a textbox appear that provides information about the node's sub tree, such as the most frequent keywords and number of descendants. Colours help to distinguish between terminal nodes (representing documents) and intermediate nodes (representing document clusters). Printed next to each lowest-level node (those at the edge of the display disc) is the most frequent keyword associated with the document or document cluster. This information provides essential cues for the user when performing directed searches through the hierarchy.

 

Document viewing

 

Clicking on the document identifier on the right panel of the Dendro and Sammon Map visualization, or double-clicking on any of the document nodes in the Radial Map visualizations causes a new frame to open which displays the content of the selected document. An example for a document from the Old Norse collection is shown in Figure 6 (with the query word saman highlighted)

 

Figure 7: Document display using an Old Norse Document

 

7. Language Particulars

Ancient Greek:

Indexing

 

Since Aristophanes of Byzantium Ancient Greek knows six different diacritics: the iota subscript, two types of breathing accents (rough and smooth), and three types of pitch-related accents (acute, grave and circumflex). While facilitating reading of ancient Greek texts, it complicates indexing since we want two words to be regarded as essentially the same if they only differ with respect to their accentuation. A robust and commonly employed method is to ignore accents altogether and thus pre-process whatever standard has been used to encode Greek characters. Perseus’ ancient Greek texts are encoded using the beta-code standard and thus necessitates the removal of any of the set of six characters {(,),=,\,/,|} before indexing can commence. A second problem is the extraordinary wealth of derived word forms in Ancient Greek (and to a somewhat lesser extent in Latin). Upon declension and conjugation words can undergo changes at any part of the stem, be it in the form of prefixes, suffixes or modifications of the stem itself. MG’s standard stemming algorithm (i.e. Porter’s algorithm) applied to Ancient Greek words would not achieve the desirable result of mapping derived word forms to the same stem. In the case of Perseus document collections, however, each document comes in two guises. The first is the beta-encoded original document, the second is essentially the same as the first except that each word has been replaced by its lexical form. This surrogate document provides an ideal source for indexing and we use it both for full text indexing and for candidate keyword indexing.

Rendering

 

The encoding used in the Perseus system is beta code that maps each of the characters and six diacritics to one of the ascii characters. In order to exploit the Unicode capabilities of Java, the beta code needs to be transliterated into Unicode. A script has been written to reverse-transliterate beta-code into Unicode. Unicode offers two possibilities for encoding accented characters: (i) one can encode character and associated accents separately (thus mirroring more closely beta code) using the two ranges 0x0370 to 0x03FF (Greek non-accented characters) and 0x0300 to 0x036F for accents. The final character is then rendered by dynamical character composition. (ii) Alternatively, one can make use of the pre-combined character range in the extended Greek block of Unicode character set (0x1F00 to 0x1FFF). Unfortunately, the Java rendering engine does not display glyphs of dynamically composed characters very well, a problem that is known and which is currently addressed by IBM. Worse, the pre-combined characters do not form part of the standard fonts included in the JVM and thus must be installed before being able to view accented glyphs. Thus, while Java provides full Unicode support when it comes to text processing such as searching for and sorting of words, its rendering capabilities do not currently live up to Java’s ambition of complete platform-independency.

 

We overcome this limitation by employing an encoding similar to the dynamical character composition encoding with individual accents, and combinations thereof, being encoded as distinct strings of 0 and 1, where a 1 indicates that the accent associated with that position has been set. Thus 01100 signifies that the preceding character carries a smooth breathing and a grave, and the beta-encoded “protre/pesqai” would result in the following pseudo Unicode encoding: 960 961 959 964 961 949 01000 963 952 945 951. All the numbers other than those corresponding to accents represent non-accented Greek characters that are properly rendered by Java. Whenever an accent-encoding string is encountered during the rendering procedure, the corresponding accents are drawn “manually” centred on the character if lowercase and preceding it if uppercase.

 

 

Figure 8: Rendering of Ancient Greek

 

As described above each Greek and Latin text comes in two guises. The surrogate version consists of the lemmatized version of the constituent words. This is used for indexing and can be used in the visualization to give the user an idea of the lexical form of the word. In the small viewing application, moving over a Greek word displays the lexical form underneath it as shown below:

 

 

Figure 9: Displaying the lexical form of words as the mouse passes over the text

 

In the example shown, the original word is a verb in the infinitive of the middle present and is lemmatized to the first person singular of the active.

 

Old Norse

 

All the Old Norse texts are available in the Unicode format and can be indexed and rendered without the need for prior transliteration.

 

Appendix:

Installation manual:

The code for both stand-alone and web-application can be downloaded at

http://faya.doc.ic.ac.uk/chlt/vishnu

 

To install follow the steps below:

 

gunzip vishnu.tar.gz

cd VIS

tar -xf vis.tar

source setup.sh

install.sh

 

The setup.sh script sets the VIS_HOME environment variable, which points to the root directory of the visualization. This will be used by the full-text search engines to locate the indices, and by the clustering program (procResults.c compiled into the shared library libCluster.so for JNI access) to locate the candidate keyword indices. It also sets the LD_LIBRARY_PATH environment variable which contains the path to the shared libraries.

 

The following subdirectories exist with the associated functionality

 

Lib

 

contains jars for using servlet technology and the Lucene full-text indexing and search functionality

MG

contains the MG distribution

collections

the top directory for various collections accessed by the system. Each collection sits in its own subdirectory and contains indices, original and derived (surrogate) documents

C

The c source for the clustering and candidate keyword extraction code

docs

Documentation files

bin

executables for candidate keyword indexing etc.

java

The bulk of the code (visualizations, GUI for collection building, Lucene and MG wrappers for Indexing, document parsers etc.), server side code for stand-alone application

 

 

Notes:

 

To employ MG as a search engine, some of MG’s native programs need to be replaced by our modified versions. The files are command.c (which contains an additional directive for retrieving the full content of a document) mgquery.c. They need to be added to the mg-1.2.1/src/txt directory before being copied to the MG’s bin directory. MG configuration file, .mgrc, and the shell script  mgbuild have been modified slightly and need to be replaced in the bin directory.

 

 

 

VisServlet is the principal servlet through which the applet accesses server-side services. Compilation requires servlet.jar to be in the classpath (this jar is found in tomcat's lib directory). When the system is installed as a web-application, the host name and port number is determined at runtime.

 

For collection building, the java program CollectionConstructor can be used in VIS_HOME/java/classes/Util.  It takes care of all necessary steps involved in building a new collection. These are (i) making import, index directories (ii) importing files  (iii) indexing files using either lucene or mg for full text search (iv) extraction of candidate keywords for online clustering (v) writing of collection specific config.xml. All necessary parameters are supplied on the command line.

 

The java wrappers for Lucene and MG are in $VIS_HOME/java/classes/Indexer and take four arguments: collection name, document source, index destination, and view type (diplomatic, regular, stemmed, lemmatized etc). The collection name is needed as mgbuild prepends each index file with the collection name and places all in a directory with the collection's name. Note that the lucene wrapper requires the lucene.jar in the classpath for compilation and running.

 

An example of indexing documents in the directory /vol/coll/norse/import using Lucene/MG where the results are to be stored in /vol/elweb/VIS/collections/Nordic/index with the specified view “lem” and the collection name is “Norse” is as below

 

java -cp $CLASSPATH Indexer/LuceneWrapper -i /vol/coll/norse/import -c Old Norse -o /vol/elweb/VIS/collections/nordic/index -s lem

 

java Indexer/MGWrapper -i /vol/coll/norse/import/ -c Old Norse -o  /vol/elweb/VIS/collections/nordic/index –s lem

 

MG indexing is achieved by running MG's mgbuild script. mgbuild has been modified to allow an additional argument (-i) which specifies the directory in which documents are found. This is passed on to mg_get for document acquisition. Using the parameter, we circumvent the need for environment variables. mg_get takes the document location as single argument. mgbuild decides where to place the indices based on the -d argument.

 

$VIS_HOME/java/classes contains a Java wrapper for the candidate keyword indexing program ck_index found in $VIS_HOME/bin. The wrapper takes two arguments which are passed on to ck_index, the first being the location of the data, the second the location of the indices to be stored.

 

The makefiles for the libraries may need to be customized to find the jni.h and jni_md.h (normally $JAVA_HOME/java/include and $JAVA_HOME/java/include/linux)