The MultiText Project

The MultiText Project is concerned with developing techniques for the indexing
and retrieval of very large electronic collections of text.  By "very large"
we are not referring merely to collections such as the *Complete Works of
William Shakespeare* or the *Encyclopedia Britannica* that might fit on one or
more CD ROM disks and be purchasable by the owner of a personal computer.
Rather, we are concerned with techniques for collections many times
larger --- all issues of a large newspaper for several decades, all journals
in a subject area, or, ultimately, a significant fraction of all text available
electronically.

In developing these techniques we are considering the many unique requirements
of very large text collections:

*Multiple Users*
It is not possible for each user to have a copy of the text and indexing
information on his or her own personal computer.  Our techniques allow many
thousands of users to simultaneously query a text collection across a computer
network.  Incoming requests are scheduled to minimize the impact users have on
one another.

*Multiple Server Machines*
Several computers must work in cooperation to provide storage and
indexing for collections of this size.  It is not feasible to store all
information on a single computer or even at a single site.  Our techniques
allow effective and efficient communication of information between user's
machines and the various machines indexing and storing the text.

*Continuous Availability*
The text collection must be updated, reorganized and extended while remaining
available to users.  The individual computers storing and indexing the text must
be maintained and repaired with only a minimal reduction in performance.  An
unexpected failure of one of the individual computers must have no effect on
availability and only a minimal effect on performance.

*Multiple Query Languages*
A variety of query languages and graphical user interfaces must be
simultaneously supported, accommodating variances in user's tastes and
abilities.

*Multiple Text Formats*
Documents in different formats must be stored in the same collection.  Despite
differences in format, users may still formulate queries that refer to document
structure --- title or author, for example.

Project Principals:
  Gordon Cormack (cormack@plg.uwaterloo.ca)
  Forbes Burkowski (fjburkow@plg.uwaterloo.ca)

Project Staff:
  Charlie Clarke (claclark@plg.uwaterloo.ca)
  Rob Good (rcgood@kiwi.uwaterloo.ca)