Cultural Heritage
Language Technologies
Workpackage 1
Advanced Digital
Library Applications
Stefan Rueger
Daniel Heesch
Dolores Iorizzo
Year 3 Executive
Summary
In the third year of CHLT we have completed and refined the development of our text documentation and clustering tool ' Vishnu' which integrates tools for (i) full-text document indexing, (ii) collection building, (iii) keyword extraction, (iv) document search and retrieval, (v) keyword-based document clustering and (vi) a suite of interactive user interfaces. It seeks to discover structure within the set of documents retrieved for a query and to expose this structure to the user to facilitate document search, and then it interacts with the user for resource discovery and discrete searching across corpora. The software is particularly well suited to non-specialists using specialist domains; that is to say those users who have only a partial knowledge of the original language. We have developed Vishnu for use with corpora provided by the CHLT consortium in the original languages of Old Norse, Neo-Latin and Ancient Greek. These texts are typically studied by users who are not specialists in classical languages but who nevertheless have a desire to understand the texts. Vishnu allows them to search these corpora in innovative ways and gives them a tool that begins to bridge the gap between passive awareness and active understanding.
Computing keywords at runtime without any pre-processing is computationally expensive. We have solved this problem by building a set of candidate keywords from the entire pool of words based on statistical criteria and then compute keywords at run-time from this limited reduced set. In Vishnu keywords serve two purposes (a) they are primarily used for document representation and document clustering, yet (b) they can also be useful in their own right by providing a user with a core vocabulary of salient terms associated with particular topics as well as potentially interesting co-occurrences of salient terms both within documents and between documents. A user who has only a limited grasp of a domain can thus acquire a more thorough understanding of a topic and consolidate the passive knowledge they may possess by exploring the set of returned keywords within the documents themselves. For users who are well-acquainted with the subject, the keywords offer an overview of the collection which allows them to test hypotheses about the occurrences of keywords in texts, and also allows a refinement of understanding about how exactly the meaning of a keyword functions in a particular text.
Vishnu offers three visualizations: two of these use the result of the keyword-based clustering of retrieved documents and invite the user to hierarchically browse the document set. A third visualization uses keywords to directly determine the position of a document on the display. Keywords are arranged in a circle around a document cluster which is attached by springs to all of the keywords in the set. The more salient a keyword for a document, then the stronger the force in pulling the document towards that keyword. The three interfaces are complementary and their relative appeal will differ between users. By providing a set of alternative interaction models, users can switch between different visualization environments for different purposes that will enable new kinds of research.
Our final efforts focused on six areas:
1) Evaluation
We have used the set of three different interfaces to carry out a user study to
enquire into the relationship between user satisfaction and aesthetic appeal of interface. The study has recently been submitted to HCI 2005. A paper on formal aspects of the algorithms involved and a quantitative evaluation of cluster performance has been submitted to D-Lib.
2) Consolidation
Consolidation work involved the development of scripts for software installation
and automated collection building; this helped up to bring to fruition our full programming and project integration within the consortium.
3) Integration
As part of our ongoing integration effort, we have insured successful installation
of Vishnu throughout the consortium. It has been installed at all sites and has dedicated servers at Tufts University where it is integrated into the CHLT/ Perseus Digital Library System, the University of Missouri, Kansas City which is also a CHLT also a CHLT anchor site, and at Imperial College London where it runs in the Department of Computing and the Newton Project, Centre for the History of Science. The software and documentation is available online at http://faya.doc.ic.ac.uk:8800/chlt/vishnu.html.
(Note: Vishnu runs on Mozilla Firefox: http://www.mozilla.org/products/firefox/ )
4) Dissemination
The development and refinement of Vishnu has lead to a number of publications and conference papers in year three: DMS 2004, JCDL 2004 and ECDL 2004, as well as the following:
S Rueger, "Putting the user in the loop: visual resource discovery", Proc 3rd
International Workshop on Adaptive Multimedia Retrieval (AMR, Jul 2005,
Glasgow), Springer LNCS, 2005
http://mmir.doc.ic.ac.uk/www-pub/amr-2005.pdf
D Heesch and S Rüger: Query-based keyword extraction and document clustering
for information retrieval and knowledge consolidation (in preparation for D-Lib)
S Rueger: Multimedia Information Systems, Fakulti Sains Komputer & Teknologi
Maklumat, Universiti Putra Malaysia, 18 Jul 2005
S Rueger, Multimedia Systems, Laboratoire GILCO, INP Grenoble, 7 Mar 2005
S Rueger, Multimedia Systems, Laboratoire ICAR, Institut de la Communication, Univ
Lumi\`ere Lyon2, 10 May 2005
S Rueger: What's a falcon got to do with baseball? New Paradigms in Multimedia
Digital Libraries, Univ of the Arts, Central St Martin, London, 27 Jan 2005
S Rueger: New Paradigms in Multimedia Management and Search, BT, Adastral Park,
Ipswich, 22 Sept 2004
S. Rueger and D. Iorizzo, Visualizations for Cultural Heritage Language Technologies,
ECDL, Bath 2004.
S Rueger: New Paradigms in Multimedia Management and Search, Institution of
Electrical Engineers, Kingston-upon-Thames, 7 Sept 2004
S Rueger: New Paradigms in Multimedia Management and Search, University of
Bamberg, Ger, Computer Science Seminars, 10 Aug 2004
5) Open Source Technology
CHLT has an open source policy, in line with the Berlin Declaration; the technical resources used in WP1 are specified below:
MG:
MG is an open source full-text retrieval system that makes efficient use of disk resources by storing the index, and the text that is indexed, in compressed form. The MG system is
described in the eponymous book Managing Gigabytes. To download MG the URL is: http://puka.cs.waikato.ac.nz/html/mg.html
Apache Lucene
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download from Apache Jakarta: http://lucene.apache.org/java/docs/
Tomcat
Tomcat is a servlet container that is used in the offcial reference implementation for the
Java Servlet and JavaServer Pages technologies. The Java Servlet and JavaServer Pages
specifications are developed by Sun under the Java Community Process. Tomcat is developed in an open source and participatory environment and released under the Apache Software License: http://jakarta.apache.org/tomcat/
LiquidLnF
LiquidLnF is a Java Swing Look and Feel archive used for enhancing the style of the graphical user interfaces: https://sourceforge.net/projects/liquidlnf
6)
Final Report on Cluster Visualisation Tool
Stephan
Rueger, Daniel Heesch and Dolores Iorizzo, Imperial College London
Dept. of
Computer Science & Centre for the History of Science
May 2005
This report presents a final summary of the work carried out
at Imperial College London during the period June 2004 to May 2005. The
principal objectives were the completion of work on a set of tools for the
clustering and the visualization of text documents and the application of these
tools to CHLT texts, including those of Newton, Old Norse, Ancient Greek and
Neo-Latin collections. We offer these tools as part of an integrated system
that performs full-text indexing, extraction and indexing of keywords,
full-text search, and clustering and visualization of returned document sets.
All
parts of the system are based on open-source software and are themselves
subject to the GNU general public licence. The system can be used as both a
stand-alone application with collections being accessed locally or remotely, or
as a web-application. The server side uses servlet technology, which has become
part of Java since 1.3 providing high-level abstractions and support of a
variety of services, such as dynamic HTML and socket communication. They are
particularly powerful in conjunction with client-side applets. While we sought
to implement as much of the code in platform-independent Java, external
software such as MG, which we employ as one of two alternative search engines,
is written almost entirely in C, and so is, for efficiency reasons, the server
side code for the clustering module.
We have developed a
standard Java API to provide a uniform way of accessing the functionality of
the indexing programs used for collection construction. Our present system
involves two sets of indices. The first set consists of indices required for
full-text search. At present we provide this indexing functionality through
either MG, a widely used open source software largely written in the C
programming language, and Lucene, a set of Java tools developed as part of the
Apache Jakarta Project. The second set of indices are related to the frequency
and other statistics of candidate keywords within all documents of a
collection. The full text and candidate keyword indexers are functionally quite
distinct, but core commonalities exist and are contained in an Indexer super
class.
package
Indexer;
abstract
public class Indexer
{
public static final int
DEFAULT = 0;
public static final int DIP
= 1;
public static final int REG
= 2;
public static final int LEM
= 3;
public static final int SYN
= 4;
public static final
String[] types = new
String[]{"full","dip","reg","lem","syn"};
int type;
String name;
String outputDir = null;
String inputDir = null;
String limiter = null;
String collName = null;
String VIS_HOME = null;
public Indexer(String n)
{
name
= n;
}
public void
setRootDirectory(String root)
{
VIS_HOME
= root;
}
public void
setIndexingType(int t)
{
type
= t;
}
public void
setOutputDirectory(String s)
{
outputDir
= s;
}
public void
setInputDirectory(String s)
{
inputDir = s;
}
public void
setLimiter(String s)
{
limiter = s;
}
public void
setCollectionName(String c)
{
collName
= c;
}
abstract public void
startIndexing();
}
In the following we
briefly describe the two full-text search engines we currently deploy before presenting
rationale and methodology of our candidate keyword indexer.
We
currently deploy two alternative open source tools for full-text indexing and
search, one is MG, the other Lucene. In order to use other full text search
engines, a wrapper needs to be written that agrees with our Indexer API.
MG is an open-source indexing and retrieval system for text, images, and textual images. MG is covered by a GNU public licence. The current version of MG is available for ftp from http://www.cs.mu.oz.au/mg/. The MG (Managing Gigabytes) system is a collection of programs which comprise a full-text retrieval system. A full-text retrieval system allows one to create a database out of some given documents and then do queries upon it to retrieve any relevant documents. It is "full-text" in the sense that every word in the text is indexed and the query operates only on this index to do the searching. A first introduction to MG can be found at http://www.mds.rmit.edu.au/mg/intro/mgintro.html, and a more detailed account is provided in the eponymous book “Managing Gigabytes”.
For indexing, MG’s mgbuild script expects to run an application or a shell script that delivers a series of documents to it via standard output. A basic sample script is provided and it is up to the user to devise a suitable script if the basic functionality proves inappropriate. We have written a C program mg_get.c which fulfills this role. It reads in a file (docs.lst) which contains the paths of all the documents to be indexed, reads them one by one and passes the content to mgbuild.
We have written a Java wrapper for MG that uses the Java
process API to call the mgbuild script. Information about the docs.lst file and
the folder in which indices are to be stored are passed on as arguments to
mgbuild.
Jakarta Lucene is a
high-performance, full-featured text search engine library written entirely in
Java. It is a technology suitable for nearly any application that requires
full-text search, especially cross-platform. It is part of the Apache Jakarta
Project whose principal goal it is to provide commercial-quality server
solutions, based on the Java Platform, developed in an open and cooperative
fashion.
The indexing is taken
over by the IndexWriter class that is instantiated with two arguments, the name
of the directory for the indices, and an instance of
org.apache.analysis.Analyzer. The Analyzer is a super class that provides the
core functionality common to various document processing subclasses. The Stop
Analyzer, for example, is a standard Java Tokenizer converting all strings to
lowercase and filtering out stop words from the index. At present Lucene only
provides the full suite of analyzers for English and German but the extension
to other language is under way. As for MG, we have written a Lucene “Wrapper”
which harvest the Lucene indexing functionality and which also inherits from
the abstract Indexer class.
The natural features of
text documents are words or phrases, and a document collection can contain
millions of such distinct features. The often-used word histogram
representation of documents consequently leads to very high dimensional
vectors. The intrinsic problem with such representations is that any two
randomly picked vectors in a high-dimensional hypercube tend to have a constant
distance from each other, no matter what the measure is. As a consequence,
clustering algorithms that are based on document distances become unreliable.
To give a numeric
example for the effect of high dimensionality, let x,y [0,1]n be drawn independently from a uniform
distribution. The expected value of their sum-norm distance is n/3, with a
standard deviation of. For n=1,800 (corresponding to a joint vocabulary of just
1,800 words for a word histogram representation) this means a typical distance
of |x-y|1 = 600 ± 10. With
increasing n, the ratio between standard deviation and vector size gets ever
smaller, as it scales with n-1/2. This is a generic statistical
property of high-dimensional spaces with any standard distance measure, and can
be traced down to the law of large numbers. Although word histogram document
representations are by no means random vectors, each additional dimension tends
not only to spread the size of a cluster but also to dilute the distance of two
previously well-separated clusters. Hence, it seems prohibitive involving all
semantic features (e.g., the words) of a document collection for document
clustering. Even after applying feature reduction, the number of features
remains large. In early experiments with 548,948 documents (TREC CDs vol3 and
vol4, including articles of the Los Angeles Times, the Financial Times, the
Federal Register, Congress Records and the Foreign Broadcast Information
Service, see http://trec.nist.gov), a candidate Keyword
had to appear in at least three documents and in no more that 33% of all
documents. This resulted in a vocabulary of around 222,872 candidate keywords.
In our system we store, at index time, around 200 candidate keywords per
document. A set R of documents returned by a query may still have a
candidate-keyword vocabulary of well over 10,000 different words. As an
example, the four sets of top 500 documents returned by the queries “mafia,''
“Clinton,'' ”cancer'' and “computer'' use a vocabulary of between 14,000 and
17,000 different candidate keywords.
The CK wrapper provides
a Java environment from within which the C programs responsible for the
extraction and indexing of candidate keywords can be started. The location of
the source documents and the folder for the indices can be set through public
methods defined in the Indexer super class. This information is passed on as
arguments to a shell script, ck_index.sh, which in turn calls the backend C
program, the functional core of the module.
To aid the construction
of new collections and to demonstrate the use of the API of the afore-mentioned
Java wrappers, we have written a small application that provides a graphical
user interface with all the basic functionality hitherto only available through
the command line. As can be seen in the screenshot of the GUI in Figure 1, the
essential information is the location of the documents to be indexed and the
path to the places where the MG and CK indices are to be kept. Other additional
information such as the name of the collection is not critical but are useful
for subsequently informing the user about the content of particular
collections.
Figure 1: GUI for collection building
Full text search is taken over by either Lucene or MG,
depending on which software was used to create the indices. The system can
determine the indexer to be used by reading in the appropriate field in the
collection’s config.xml file. For both MG and Lucene, we have written Search
wrappers that implement a basic searcher interface as shown below.
package
server.Search;
import
java.util.*;
public
interface SearchInterface
{
public void search(String
query, String collection, int view);
public Vector
getDocIdentifier();
public Vector
getDocDescriptions();
public Vector
getDocContent();
}
MG’s search engine can
be run in a number of modes. We currently ask MG to perform ranked query
searches and to return for each relevant document its index and a brief section
of the text. The query program already has the functionality to extract short
samples of text. We modified this method so that it would skip as many
non-alphabetical characters as were found at the beginning of the document. An
extra command directive was further added to the query program to be able to
retrieve the full content of relevenat documents.
Full document search is
achieved through the collaboration of an IndexSearcher, an Analyzer and a
QueryParser. The query parser is constructed with an analyzer used to interpret
the query in the same way the Index was previously interpreted. The Query
object contains the results from the QueryParser which is passed to the
searcher. The searcher results are returned in a collection of Documents. The
Lucene wrapper is again given in the technical appendix. Note that it
implements the Searcher interface which specifies core methods required by our
application (e.g. getDocContent(), search() etc).
We have adopted what is
at the same a more systematic and a more dynamic way of associating the
visualization with collections. According to the new model, which has in a
similar way been adopted by GSDL (Greenstone Digital Library), collections are
grouped to form sites. At each level of the hierarchy, we can specify
properties and services that hold for all the collections contained in the
corresponding sub tree. At start up, the system recursively scans the sites
directory, initializes and presents to the user the various collections found.
At the lowest level of
the hierarchy we allow each collection to consist of a number of “views'', each
with its own set of indices. The default view is the set of unprocessed
documents, derived views or surrogate documents are the result of such
operations as stemming, stopping, or lemmatization of individual words. Thus,
generating views amounts to the formation of equivalence classes based on
morphological, semantic or functional similarity. Applying a stop list to a
collection, for example, can be construed as a mapping of all commonly
occurring words to an empty class, and a mapping of all other words to
themselves.
This structural
extension has implications on how collection building and search is carried
out. In the process of collection building, documents are first imported from a
specified directory, with documents for each view placed in subdirectories on
their own. Each such subdirectory is assumed to contain a file with a list of
all documents to be indexed for that particular view. The indexer creates a set
of parallel indices in the collection's index directory with each subdirectory
being named by the first three letters of the operation that characterizes the view
(def=default, ste=stemming, lem=lemmatization, syn=synonyms etc.). Document
processing is not currently part of the integrated indexing software, i.e. the
indexer assumes the documents to be pre-processed, and collection building thus
consists, at the highest level, of the two steps of document processing
and document indexing.
At present the user
specifies the desired view as part of the query string in a similar fashion as
advanced query formulation is achieved in the input field of conventional text
search engines like google. E.g. “\l'' indicates that the query is to be
lemmatized and the search to be carried out in the lemmatized documents.
Because collections are likely to differ in the number and types of views they
offer, it might well be justified to use an additional graphical component to
select the view. When no view is specified, the system assumes the default view
(i.e. the query remains unprocessed and indices are retrieved from the
index/def directory). While both full text indices and candidate keyword
indices are view-specific, the actual document to be retrieved is still the
original, unprocessed document.
Our
clustering is based on the joint document/keyword distribution. While the
documents are available from the full-text search engine, the keywords still
need to be determined. Because we have already narrowed down the set of
potential keywords to our set of candidate keywords at index time, our task is
relatively straightforward and will be detailed below.
Processing the results is done in real time and has to be fairly quick so we have taken some care to try and use algorithms that are efficient and as close to O n log n as possible. We use trees as data structures as trees allow the fast insertion and location of items in sorted data sets. Trees lend themselves to recursive algorithms as their structure is recursive. In using recursion with the trees we have removed instances of tail recursion. When the query program returns a result it is in the form of a document number and a short piece of the document text from each document or snippet (see Appendix A), in the search result set. The ‘result processing’ program takes the document number of each document and loads the matching ordered vector of the indices and the populations of candidate key words. It reads these from the variable length record file compiled in section 4.2. The snippet is stored in an array for later use. A tree is built containing the index of each word in the result set, the search set document frequency and the archive set document frequency. This tree is ordered by archive set document frequency and as the vectors are also ordered the same way the iteration filling the tree with the contents of each vector is shuffled so as to generate a shallower tree than would otherwise be the case. The median values are entered into the tree before the extreme values. The deeper the tree the more the time taken build it approaches O(nm) where n is the number of different candidate key words in the search set while m is the (large) number of references to possibly interesting words in the document vectors. An optimal tree should take O(m log n) time to build, log n being the depth of the tree. Once the tree is built the search set document frequency for each candidate key word is known. The tree is then iterated to calculate the value of ‘interesting weight’ (see section 2.1) for each candidate key word in the tree, using the method explained in section 2.1. A new tree ordered by this ‘interesting weight’ is built by pulling leaves from the old tree and inserting them into the new tree, thus destroying the old tree. The top 100 elements of this new tree are used to build an array of key word indices. The rest of the candidate key words are discarded. This array is used to build a translation table that maps the array position of the key word to the order of the index value of the words. The table filled in the order of decreasing key word weight and then sorted using ‘quick sort’ in the order of increasing key word index value.
The next task is to build a matrix of the incidents of key words in documents. The documents represent the rows and the key words the columns. As the vast majority of the entries in the matrix will be zero we opt for a sparse matrix design. In this incarnation a sparse matrix consists of an array of linked list representing the rows of the matrix. The linked lists are lists of objects that describe the column of the matrix, the value of matrix node and the address of the next object or null. There need be no order to the objects in a row. There must not be any duplicate references in a row to the same column. To construct this sparse matrix we iterate through each of the vectors containing the indices of candidate key words for the documents in the search result. As indices in both the translation table and the vectors are ordered in order of increasing key word index value, the two lists of indices can be efficiently compared without having to literally check each element in one against each element in the other. For example, if the current index in the translation table is higher than the current index in the vector we can skip to the next index in the vector. If the indices match then an entry is added to a sparse matrix row for that document. A sparse matrix is quickly built up that contains the key word versus document information for the whole search result set.
At this point those documents that contain none of the key words are eliminated from the matrix, as none of the visualizations can say anything about them. In the vast majority of cases, documents that are eliminated are those that have been ranked by the search engine below the 400th place in the result set.
Document clustering has
attracted much interest in the recent decades, and much is known about the
importance of feature reduction in general, and clustering in particular.
However, little has been done so far to facilitate feature reduction for
document clustering of query results. In contrast to the conventional approach
of using a tf-idf weighting scheme, we suggest ranking the importance of each
such candidate keyword j with a weight
where |R| is the total
number of documents in the set R of returned documents, rj is the
number of documents in R containing word j, and dj is the number of
documents in the whole document collection D containing j. The second factor
prefers words with medium-valued matched-document frequencies, while the first
factor prefers words that specifically occur in the matched documents. The
highest-ranked words are meant to be related to the query. Indeed, we have
“software'', “IBM'', “UNIX'' etc as the top-ranked words when querying for
“computer''. This seems to be a powerful approach to restrict the features of
the matched documents to the top k ranked words, which we will call the
keywords. One important aspect is that the features are computed at query time.
Hence, when the above query is refined to “computer hardware'', a completely
new set of keywords emerges automatically.
For each matched
document i we create a k-dimensional vector vj, whose j-th component
vij is a function of the number of occurrences tij of the
j-th ranked keyword in the document i:
This is a variation of
the tf-idf weight that gives less stress to the term frequency tij
(the number of occurrences of the jth ranked keyword in document i). We project
the vector vi onto the k-dimensional unit sphere obtaining a
normalized vector ui that represents the document i. We deem the
scalar product of ua and ub (i.e. the cosine of the angle
between vectors va and vb) a sensible semantic similarity
measure between two documents a and b in the document subset R returned by a
query with respect to the complete document collection D.
u may be viewed as a
document representation matrix where the row vector ui is a
k-dimensional representation of document i and uij is viewed as the
importance of keyword j for document i. In particular, uij = 0 if and
only if document i does not contain keyword j. The number of features k can be
controlled by the experimenter, and our experiments using the TREC data of
human relevance judgments have shown that k=10 yields superior clustering
results. Note that even if only the top ten keywords are used for the
clustering and document representation, we might still display more keywords on
the screen to assist the user in his or her search.
Post-retrieval document
clustering has been well studied. We deploy a variant of the Buckshot
algorithm: each cluster contains a certain number of document vectors and is
represented by their normalized arithmetic mean, the so-called centroid vector.
In the first phase, hierarchical clustering with complete linkage operates on
the best-ranked 150 documents. This can be done in a fraction of one second CPU
time. Hierarchical clustering has the advantage that one can either prescribe
the number N of clusters or let the number of clusters be determined by demanding
a certain minimum similarity within a cluster. Either way, once clusters within
the top-ranked documents are identified, their N centroids can be computed and
used as a seed for standard iterative clustering of the remaining documents.
This algorithm scales linear with the number |D| of documents and the number N
of clusters. In our experience, one cycle of iterative clustering is not only
adequate but also preserves the cluster structure given by the top-ranked
documents. 1,000 documents can thus be clustered in well under one second on a
2~GHz PC.
For reasons of
efficiency, some of the critical server-side code has been written using the C
programming language. These are the programs responsible for keyword
computation and document clustering. To access this functionality, we use the
Java Native Interface (JNI), which is the native programming interface for Java
and part of the JDK. By writing programs using the JNI, one can ensure that the
code is completely portable across all platforms. The JNI allows Java code that
runs within a Java Virtual Machine (VM) to operate with applications and
libraries written in other languages, such as C, C++, and assembly. Programmers
use the JNI when an application cannot be written entirely in the Java
programming language. For example, when the standard Java class library may not
support the platform-dependent features needed by the application, when one
already has a library or application written in another programming language,
when some small portion of time-critical code is to be written in a lower-level
programming language, such as assembly, and then have the Java application call
these functions. The JNI framework lets native methods utilize Java objects in
the same way that Java code uses these objects. A native method can create Java
objects, including arrays and strings, and then inspect and use these objects
to perform its tasks. A native method can also inspect and use objects created
by Java application code. A native method can even update Java objects that it
created or that were passed to it, and these updated objects are available to
the Java application. Thus, both the native language side and the Java side of
an application can create, update, and access Java objects and then share these
objects between them.
We
have developed three visualizations of the set of returned documents. Two of
these, the Sammon Map and the Dendro Map, make direct use of the clustering
result. The application as a whole offers the possibility of browsing the same
result set in several different ways simultaneously. The cluster-based
visualizations give a broader overall picture of the result, while the Radial
visualization allows the user to focus on subsets of keywords. Also, as the
clusters are approximations that bring to the fore particular keywords, it may
be useful to return to the Radial visualization and examine the effect of these
keywords upon the whole document set. The Radial visualization will perhaps be
more fruitful if the initial keywords match the user's area of interest. The
Sammon Map will let the user dissect search sets and re-cluster subsets,
gradually homing in on target sets. Again the user may have recourse to the
Radial visualization and apply those keywords to the whole result set. In the
end, he or she will need to read only a small fraction of the retrieved
documents. The cluster-based visualizations give a visual analogy of the
structure implicit in the library classification scheme. The Radial
visualization throws up the effects of keywords that cause cross references
between documents, and allows the user to skim between subject areas.
The simplest of our visualizations is a conventional ranked
list display of the results set. This is the style adopted by most web search
engines such as Google or AltaVista. For each hit, we display the document
number, which itself is a link to the original document, and a small extract of
the document.
Figure 2: Conventional Ranked List display of results
sets.
This visualization uses a
non-linear Sammon transformation to map the locations of document cluster
centroids to points on a two-dimensional plane. The Sammon map is computed
using an iterative gradient search as detailed in Sammon’s original paper. The algorithm aims to represent
n-dimensional points in a lower-dimensional space while attempting to preserve
the pair-wise distances between the objects. The arrangement of the clusters on
the display is such that their mutual distances are indicative of the relationships
found in the original higher-dimensional space.
In the central panel each cluster is represented by a circle and is labelled with its most frequent keyword. The radius of the circle informs about the size the cluster. The distance between any two circles in the graphic panel is an indication of the similarity of their respective clusters: the nearer the clusters on the display, the more likely the documents contained within will be similar.
The Sammon visualization allows the user to select clusters and thus to home in on smaller sets of documents. When the mouse passes over the cluster circle a `tool tip' box in the form of a pop-up menu appears. Operations such as “select” and “drill down'' can be performed for any particular cluster or set of selected clusters.
Figure 3: Basic Sammon Map Visualization.
The first item in the cluster popup menu shows the number of documents contained in that cluster. Choosing this item displays a scrolled table of cluster keywords in the pane on the left-hand side of the visualization and a scrolled list of cluster document hot-links and snippets appear in the scrolling text window at the bottom of the display.
The drill down item in the
pop-up menu performs a redisplay of the documents in the set of selected
clusters. Clustering is performed online on this new subset of documents and a
new Sammon map is created. Unlike in the Dendro map, where the undrill
functionality is associated with the current root node and is offered to the
user when she moves over the central node, no such root node exists in the
Sammon map, and thus the operational symmetry between the two visualizations
breaks downs at this point: upon undrill the level indication at the top of the
display is incremented and the back button enabled. The select item in the
pop-up menu marks a cluster as selected and signals this with a flag. The other
menu items serve to label the cluster with four significant keywords and are
not selectable. An example of a drill-down operation is presented in the Figure
3 where the documents to be clustered are the union of the three clusters
selected in Figure 2 (recital, lisztian, and recital).
On the right of the main
visualization, we display a small extract of the document similar to that on
the ranked list display. Clicking on the document identifier opens up a new
frame displaying the full content of the document.
Figure 4: Drill Down on the union of three selected
clusters.
At the new level, we may
choose a new cluster. The new table of keywords on the left is thus associated
with the cluster labelled “polonaise”. The four keywords displayed in the
pop-up menu gives the user a rough idea of what to expect.
The table to the left shows
the list of most prominent keywords and the frequency of their occurrence. It
includes a box field which can be selected. At the bottom of the table is a
filter button that makes the scrolling text window display only the hot-links
and snippets from documents that contain the selected keywords. This filtering
functionality is placed in its own class and can thus readily be used in
conjunction with other visualization.
The Radial Map is similar to VIBE, Radviz and to Lyberworld. Radviz places the keyword nodes in a circle as dimensional anchors and the document nodes in the middle as if suspended by springs connecting them to keyword nodes. The greater the content the stronger the effect of the springs on the location of the document. Hence, we make direct use of the representation matrix u without explicit clustering. Radial adds a level of user interaction to the metaphor introduced by RadViz and VIBE. Building on VIBE, Lyberworld takes a similar idea into three dimensions. In Lyberworld vector addition is used to position the documents nodes within a relevance sphere. The document's keyword content creates a vector between the centre axis of the sphere and the position of that keyword on the sphere's surface. The radius of the sphere is defined by the range of possible vector lengths. In Lyberworld the relevance sphere can be rotated so that the 2D computer display can give more of a clue as to the 3D location of the document node relative to the keyword nodes. Also the relative attractiveness of the keyword nodes can be enhanced to pull related documents towards them. Radial, staying with only two dimensions, dispenses with some of the perceptual complexity implicit in rendering a three-dimensional model on a two-dimensional screen.
Radial
also uses the statistically collected keywords to differentiate the documents.
Initially, the twelve most highly ranked keywords are displayed in a circle.
Any documents in the search set that contain those keywords are placed using a
similar vector sum within the circle. As the mouse passes over the document
nodes, a bubble displays a descriptive piece of text from the document. The
dimensions of the circle are more arbitrary than in Lyberworld and if the
display were simply based on a flat sum of vectors it would be possible for the
document nodes to be outside the circle. However we have constrained their
positions by projecting the radial locations through the arc-tangent function,
with the result that the documents cannot be moved outside of the circle.
Figure 5: Radial Map Visualization
Dimensionality reduction has the
effect that locations of document nodes become ambiguous. There may be many
reasons for a node to have a particular position. To mitigate this ambiguity in
Radial the user can click on a document node, and the keyword nodes that affect
the location of document are highlighted. We believe that this is a novel and
useful feature. Clues to the effects of different dimensions are also given
when a keyword node is selected with the mouse: the document nodes that contain
the keyword are highlighted. Similarly to Lyberworld the vector effect of a
particular keyword node on the document set can be enhanced. However, in Radial
this is achieved by grabbing the keyword node with the mouse and moving it away
from the edge of the circle: All documents that contain this keyword follow the
movement of the keyword, accordingly. Manual clustering can be effected by
placing several keyword nodes together. Alternatively, a button allows the
displayed keyword arrangement to be automatically clustered using the columns
of the matrix u.
The choice of keywords used in
the display can be enhanced and reduced by clicking on two visible lists of
words. Zoom buttons allow the degree of projection to be increased or reduced
so as to distinguish between document nodes around the edges of the display or
at the centre.
The Radial visualization is meant
to be an interactive tool for structuring the document set according to one's
own preferences by shifting keywords around in the display. The shortcoming of
the Radial visualization is that it can only say something about the documents
in the result set that contain the particular set of keywords, and increasing
the number of keywords eventually makes the whole display very difficult to
interpret. This is where the cluster-based visualizations prove their merits.
The
Dendro Map aims to achieve a synthesis between the binary tree clustering and
space-filling, focus-and-context display techniques such as tree maps,
hyperbolic browsers or fractal layouts. The rationale behind this hybrid
approach lies in combining the virtues of both approaches (uniformity,
effective space-utilisation, respectively) while trying to remain
computationally inexpensive so as to be able to deal with large text corpora.
The Dendro Map provides a uniform
view of hierarchically organized information. The nodes are binary and the tree
radiates out from the root node, which is positioned at the center of the
display. The distances between successive nodes shrink with their distance from
the root and the Dendro Map thus mimics the effect of mapping a hyperbolic
distortion of the tree onto the Euclidean disc. It differs from the hyperbolic
approach in two fundamental aspects. First, in the Dendro Map, the tree is
clipped at the fringe of the circular display after a fixed number of nodes. By
contrast, a hyperbolic visualization relies on an invertible, and thus
information-preserving mapping of the entire tree hierarchy. In practice,
however, the same limitation regarding the number of levels also applies to the
hyperbolic display as distances towards the fringe become exceedingly small to
be resolved. Secondly, unlike the hyperbolic visualization, the Dendro Map
displays context information only in one direction as parent nodes disappear
from sight as lower level nodes are placed into focus. While this may be
considered a restriction for undirected browsing tasks, it facilitates directed
search by maximizing the display space for the children of the current
top-level node.
Figure 6: Plane-filling binary tree in the Dendro Map
Navigation
through the tree is achieved by selecting any node of the current display as
the new focus. The node is moved into the centre and new nodes will appear at
the fringe of the display disc. Hovering over a node lets a textbox appear that
provides information about the node's sub tree, such as the most frequent
keywords and number of descendants. Colours help to distinguish between
terminal nodes (representing documents) and intermediate nodes (representing
document clusters). Printed next to each lowest-level node (those at the edge
of the display disc) is the most frequent keyword associated with the document
or document cluster. This information provides essential cues for the user when
performing directed searches through the hierarchy.
Clicking
on the document identifier on the right panel of the Dendro and Sammon Map
visualization, or double-clicking on any of the document nodes in the Radial
Map visualizations causes a new frame to open which displays the content of the
selected document. An example for a document from the Old Norse collection is
shown in Figure 6 (with the query word saman highlighted)
Figure 7: Document display using an Old Norse Document
Since Aristophanes of
Byzantium Ancient Greek knows six different diacritics: the iota subscript, two
types of breathing accents (rough and smooth), and three types of pitch-related
accents (acute, grave and circumflex). While facilitating reading of ancient
Greek texts, it complicates indexing since we want two words to be regarded as
essentially the same if they only differ with respect to their accentuation. A
robust and commonly employed method is to ignore accents altogether and thus
pre-process whatever standard has been used to encode Greek characters.
Perseus’ ancient Greek texts are encoded using the beta-code standard and thus
necessitates the removal of any of the set of six characters {(,),=,\,/,|}
before indexing can commence. A second problem is the extraordinary wealth of
derived word forms in Ancient Greek (and to a somewhat lesser extent in Latin).
Upon declension and conjugation words can undergo changes at any part of the
stem, be it in the form of prefixes, suffixes or modifications of the stem
itself. MG’s standard stemming algorithm (i.e. Porter’s algorithm) applied to
Ancient Greek words would not achieve the desirable result of mapping derived
word forms to the same stem. In the case of Perseus document collections,
however, each document comes in two guises. The first is the beta-encoded
original document, the second is essentially the same as the first except that
each word has been replaced by its lexical form. This surrogate document
provides an ideal source for indexing and we use it both for full text indexing
and for candidate keyword indexing.
The encoding used in the Perseus system is beta code that maps each of the characters and six diacritics to one of the ascii characters. In order to exploit the Unicode capabilities of Java, the beta code needs to be transliterated into Unicode. A script has been written to reverse-transliterate beta-code into Unicode. Unicode offers two possibilities for encoding accented characters: (i) one can encode character and associated accents separately (thus mirroring more closely beta code) using the two ranges 0x0370 to 0x03FF (Greek non-accented characters) and 0x0300 to 0x036F for accents. The final character is then rendered by dynamical character composition. (ii) Alternatively, one can make use of the pre-combined character range in the extended Greek block of Unicode character set (0x1F00 to 0x1FFF). Unfortunately, the Java rendering engine does not display glyphs of dynamically composed characters very well, a problem that is known and which is currently addressed by IBM. Worse, the pre-combined characters do not form part of the standard fonts included in the JVM and thus must be installed before being able to view accented glyphs. Thus, while Java provides full Unicode support when it comes to text processing such as searching for and sorting of words, its rendering capabilities do not currently live up to Java’s ambition of complete platform-independency.
We overcome this limitation by employing an encoding similar to the dynamical character composition encoding with individual accents, and combinations thereof, being encoded as distinct strings of 0 and 1, where a 1 indicates that the accent associated with that position has been set. Thus 01100 signifies that the preceding character carries a smooth breathing and a grave, and the beta-encoded “protre/pesqai” would result in the following pseudo Unicode encoding: 960 961 959 964 961 949 01000 963 952 945 951. All the numbers other than those corresponding to accents represent non-accented Greek characters that are properly rendered by Java. Whenever an accent-encoding string is encountered during the rendering procedure, the corresponding accents are drawn “manually” centred on the character if lowercase and preceding it if uppercase.
Figure 8: Rendering of Ancient Greek
As described above each Greek
and Latin text comes in two guises. The surrogate version consists of the
lemmatized version of the constituent words. This is used for indexing and can
be used in the visualization to give the user an idea of the lexical form of
the word. In the small viewing application, moving over a Greek word displays
the lexical form underneath it as shown below:
Figure 9: Displaying the lexical form of words as the mouse passes over the text
In the example shown, the
original word is a verb in the infinitive of the middle present and is
lemmatized to the first person singular of the active.
All the Old Norse texts are
available in the Unicode format and can be indexed and rendered without the
need for prior transliteration.
The code for both stand-alone and web-application can be
downloaded at
http://faya.doc.ic.ac.uk/chlt/vishnu
To install follow the steps below:
gunzip vishnu.tar.gz
cd VIS
tar -xf vis.tar
source setup.sh
install.sh
The setup.sh script sets the VIS_HOME environment variable, which points to the root directory of the visualization. This will be used by the full-text search engines to locate the indices, and by the clustering program (procResults.c compiled into the shared library libCluster.so for JNI access) to locate the candidate keyword indices. It also sets the LD_LIBRARY_PATH environment variable which contains the path to the shared libraries.
The following subdirectories exist with the associated
functionality
Lib |
contains jars for using servlet technology and the Lucene
full-text indexing and search functionality |
MG |
contains the MG distribution |
collections |
the top directory for various collections accessed by the
system. Each collection sits in its own subdirectory and contains indices,
original and derived (surrogate) documents |
C |
The c source for the clustering and candidate keyword
extraction code |
docs |
Documentation files |
bin |
executables for candidate keyword indexing etc. |
java |
The bulk of the code (visualizations, GUI for collection
building, Lucene and MG wrappers for Indexing, document parsers etc.), server
side code for stand-alone application |
Notes:
To employ MG as a search engine, some of MG’s native
programs need to be replaced by our modified versions. The files are command.c
(which contains an additional directive for retrieving the full content of a
document) mgquery.c. They need to be added to the mg-1.2.1/src/txt directory
before being copied to the MG’s bin directory. MG configuration file, .mgrc,
and the shell script mgbuild have
been modified slightly and need to be replaced in the bin directory.
VisServlet is the principal
servlet through which the applet accesses server-side services. Compilation
requires servlet.jar to be in the classpath (this jar is found in tomcat's lib
directory). When the system is installed as a web-application, the host name
and port number is determined at runtime.
For collection building, the java
program CollectionConstructor can be used in VIS_HOME/java/classes/Util. It takes care of all necessary steps
involved in building a new collection. These are (i) making import, index
directories (ii) importing files
(iii) indexing files using either lucene or mg for full text search (iv)
extraction of candidate keywords for online clustering (v) writing of
collection specific config.xml. All necessary parameters are supplied on the
command line.
The
java wrappers for Lucene and MG are in $VIS_HOME/java/classes/Indexer and take
four arguments: collection name, document source, index destination, and view
type (diplomatic, regular, stemmed, lemmatized etc). The collection name is
needed as mgbuild prepends each index file with the collection name and places
all in a directory with the collection's name. Note that the lucene wrapper
requires the lucene.jar in the classpath for compilation and running.
An example of indexing documents in the directory /vol/coll/norse/import using Lucene/MG where the results are to be stored in /vol/elweb/VIS/collections/Nordic/index with the specified view “lem” and the collection name is “Norse” is as below
java -cp $CLASSPATH Indexer/LuceneWrapper -i
/vol/coll/norse/import -c Old Norse -o /vol/elweb/VIS/collections/nordic/index
-s lem
java Indexer/MGWrapper -i /vol/coll/norse/import/ -c Old
Norse -o
/vol/elweb/VIS/collections/nordic/index –s lem
MG indexing is achieved by
running MG's mgbuild script. mgbuild has been modified to allow an additional
argument (-i) which specifies the directory in which documents are found. This
is passed on to mg_get for document acquisition. Using the parameter, we
circumvent the need for environment variables. mg_get takes the document
location as single argument. mgbuild decides where to place the indices based
on the -d argument.
$VIS_HOME/java/classes
contains a Java wrapper for the candidate keyword indexing program ck_index
found in $VIS_HOME/bin. The wrapper takes two arguments which are passed on to
ck_index, the first being the location of the data, the second the location of
the indices to be stored.
The makefiles for the libraries may need to be customized to find the jni.h and jni_md.h (normally $JAVA_HOME/java/include and $JAVA_HOME/java/include/linux)