Table Of Content[18] G. Salton. Automatic Text Processing. Addison-Wesley, USA, 1989.
[19] G. Salton and C. Buckley. On the automatic generation of content links in
hypertext. Technical Report 89-993, Cornell University, April 1989.
[20] F.Sarreand U.Gu(cid:127)nzter. Automatictransformationof lineartextinto hypertext.
In International Symposium on Database Systems for Advanced Applications,
pages 498{506, Tokyo, Japan, 1991.
[21] J. A. Thom, A. J. Kent, and R. Sacks-Davis. TQL: Tutorial and user manual.
Technical Report 18, Key Centre for Knowledge Based Systems, Departments of
Computer Science, RMIT and the University of Melbourne, 1990.
[22] J. A. Thom, A. J. Kent, and R. Sacks-Davis. TQL: a nested relational query
language. Australian Computer Journal, 23(2):53{65, May 1991.
[23] J. Zobel, R. Wilkinson, E. Mackie, J. Thom, R. Sacks-Davis, A. Kent, and
M. Fuller. An architecture for hyperbase systems. In 1st Australian Multi-Media
Communications and Applications and Technology Workshop,Sydney, Australia,
July 1-2 1991. Also available as Technical Report 42, Key Centre for Knowledge
Based Systems, RMIT and the University of Melbourne.
[3] B. Campbell and J.M. Goodman. HAM: a general purpose hypertext abstract
machine. Communications of the ACM, 31(7):856{861, 1988.
[4] K. Chattrasophon. Enhanced access mechanisms for document retrieval system.
Master's thesis, Department of Computer Science, Royal Melbourne Institute of
Technology, Melbourne, 1987.
[5] J. Conklin. Hypertext: An introduction and survey. IEEE Computer, 20(9):17{
41, September 1987.
[6] Software Exoterica Corporation. XGML Translator 1.0, 1990.
[7] C.J. Date. An Introduction to Database Systems, 4th edition. Addison-Wesley
Publishers, USA, 1986.
[8] M. E. Frisse. Searching for information in a hypertext medical handbook. In
Hypertext '87 Papers, pages 57{66, Chapel Hill, North Carolina, 1987.
[9] M. Fuller, A. Kent, R. Sacks-Davis, J. A. Thom, R. Wilkinson, and J. Zobel.
Querying in a large hyperbase. In Second International Conference on Database
and Expert Systems Applications, Berlin, Germany, August 21-23 1991.
[10] G. W. Furnas. Generalized (cid:12)sheye views. In CHI'86 Proceedings, pages 16{23,
April 1986.
[11] P.K.Garg. Abstractionmechanismsinhypertext. Communications of the ACM,
31(7):862{870, July 1988.
[12] U. Hahn and U. Reimer. Automatic generation of hypertext knowledge bases.
In Proceedings ACM Conference on O(cid:14)ce Information Systems, pages 182{188,
Palo Alto, California, 1988.
[13] E. Mackie and J. Zobel. Retrieval of tree-structured data from disc. In Proceed-
ings of the Third Australian Database Conference, Melbourne, Australia, 1992.
[14] W. Merkl, S. Vieweg, and A. Karapetjan. KELP: A hypertext oriented user-
interface for an intelligent legal fulltext information retrieval system. In Inter-
national Conference on Database and Expert System Applications | DEXA 90,
pages 399{404, Vienna, Austria, 1990.
[15] C. D. Paice. Constructing literature abstracts by computer: Techniques and
prospects. Information Processing and Management, 26(1):171{186, 1990.
[16] D. R. Raymond. lector | an interactive formatter for tagged text. Technical
Report OED-90-02, Centre for the New Oxford Dictionary and Text Research,
University of Waterloo, 1990.
[17] B. N. Rossiter, T. J. Sillitoe, and M. A. Heather. Database support for very
large hypertexts. Electronic Publishing, 3(3):141{154, August 1990.
The second database was a technical paper written in SGML. This document
had a deep hierarchical structure. This database consisted of only 3805 nodes and
292 links, but demonstrated the ability to store complex structures, and to display
them in various ways, including the implementation of a (cid:12)sh-eye view of the table of
contents.
The next database will providean exampleof a large hyperbase with a fairly com-
plex structure. Again, such a hyperbase would be prohibitively di(cid:14)cult to construct
manually. It consists of a large portion of the Commonwealth of Australia's Acts of
Parliament, and has a fairly complex structure. It is estimated that this database
will consist of 1 million nodes and 10 million links.
The cost of installing a new database is worth mentioning. If the text is not
already in SGML, format, a grammar must be devised to markup the text in the
appropriate way. Next, a grammar needs to be written to allow node creation based
on the text structure. The rest of the system is generic, dependent entirely upon the
SGML markup, so that enormous savings accrue in the creation of large hypertext
databases.
8 Conclusions
We have described a system that provides an automatic method for generating richly
structureddocumenthyperbases. Thesehyperbasessupportbothbrowsingandranked
natural language querying. This has occurred through the successful integration of
SGML, hypertext browsing capabilities, sophisticated information retrieval querying
techniques,and anestedrelationaldatabase system. Keybene(cid:12)tsarethatthissystem
requires minimal manual assistance, and is generically applicable. The use of SGML
is a central factor in this. It o(cid:11)ers a suitable representation for document structure
that can be used in the conversion of source text into hypertext. By preserving the
SGML markup within nodes, re-assembly of document fragments for presentation is
easy. The nested relational model proved to be a suitable database engine for such a
system since it provides both the capabilities and speed required.
Three immediate areas of challenge are hypertext node generation, high-level
querying,and queryranking. Node{ node, and node{ document relationshipswithin
source text need to be recognised and converted to explicitlinks. It remains to incor-
porate link information into querying, so that use can be made of inter- and intra-
document relationships and database and document structure. Also, devising rank-
ing algorithms that take into account the structure and fragmentation of documents
would be of immediate bene(cid:12)t.
References
[1] ISO 8613. Information Processing|Text and O(cid:14)ce systems | O(cid:14)ce Document
Architecture (ODA) and Interchange Format, 1989.
[2] ISO 8879. Information Processing | Text and O(cid:14)ce systems | Standard Gen-
eralized Markup Language (SGML), 1986.
Auser can move fromone node to another eitherby selectinga new point in the table
ofcontents,byfollowingalink,orbyissuingaquery. Inthelattercase,thecontrolling
unit calls a text query processor which creates a window into which the query can
be entered. The query processor then identi(cid:12)es and ranks relevant documents. The
controlling unit can then present that list of documents, along with some summary
informationabout each potential destination, via another lector process. In the other
cases, havingdeterminedthe appropriate destination, the relevantnodes are retrieved
from the Titan+ database, and reformed into tagged text. This can then be passed
to a lector process for display. (See Figure 2.)
7.2 Sample Hyperbases
Twodatabases havebeencreatedusingthetoolsdescribedearlier. Thesedemonstrate
the ability to import document structure into a large hyperbase. One database con-
sists of 611 documents, albeit with little structure, and the other is a single technical
paper with a complex structure.
Figure 3: The User View
The (cid:12)rst of the two is su(cid:14)ciently large that it would have been prohibitively
di(cid:14)cult to generate manually. It consisted of Victorian Government press releases,
where each release had a title and a body. As well as structural links, a set of
links between references to signi(cid:12)cant entities in Victorian politics was created. This
resulted in a database with 8,657 nodes and 20,598 links. Figure 3 shows what the
user sees when browsing this hyperbase.
MENU TABLE OF TEXT QUERY
WINDOW CONTENTS WINDOW WINDOW
WINDOW
QUERY
LECTOR LECTOR LECTOR PROCESSOR
NODE
MANAGER
TITAN+
HYPERTEXT
DATABASE
Figure 2: Hyperbase Implementation
that displays textin windows and allows interactive behaviour by reporting whenever
and where in a window a mouse button click occurs. Lector has been used as a text
previewer, database browser, code pretty printer, menu utility and iconic interface.
We use it to display nodes, to display a table of contents, to present menus, and to
allow links and buttons to be activated.
The particular advantage of lector, which is derived from its SGML-based nature,
is that the same data may be presented in substantially di(cid:11)erent ways. It uses simple
speci(cid:12)cation (cid:12)les to determine how it is to interpret the SGML tags it (cid:12)nds, and can
switch between speci(cid:12)cation (cid:12)les as instructed. One speci(cid:12)cation (cid:12)le might instruct
the lector process to present the text in two columns, with references in bold, and
section headings omitted, and links highlighted reverse video. Another might cause
only titles to be displayed, creating a table of contents. A third might indicate to
lectorto output raw text, tags and all. This provides us signi(cid:12)cant scope in providing
an appropriate user interface to the hypertext database.
The components of each layer are then related in the following way. A controlling
unit creates several lector processes to display nodes, menus, and a table of contents.
that area. This type of aid can be extremelybene(cid:12)cialto users browsing a hyperbase,
and is appropriate for the displaying Tables of contents.
Movement within the database can be done by following links within a document,
selecting di(cid:11)erent points within the table of contents, or by querying. Querying in-
volves thecreationof a queryinterfaceinto which a userinputs hisor herquery. After
the appropriate nodes are identi(cid:12)ed, they are collated into a query node containing
thequerytext,summaryinformationabout thedocuments satisfying thequeries,and
links to the potential destination.
7 A Prototype Hyperbase
7.1 System Overview
Before text enters our hyperbase, it passes through a number of processes, as previ-
ously described.
6
Raw, un-tagged text enters the system via XGML, an SGML parser/validator.
By applying some simple rules, unmarked ASCII text can be converted to SGML
text. For example, a particularly simple rule might be that a newline followed by
whitespace represents a new paragraph.
The next phase involves passing SGML tagged text through XGML. In this pass,
XGML is used to split up text into nodes and links that can be directly inserted into
a hyperbase. XGML also provides the ability to provide grammar based treatment
of the SGML parse tree. The actions that are taken are dependent on the grammar
described in the document DTD. Current context and the status of other sections of
the parse tree can be taken into account, allowing very complex rules to be used. A
simple example of a rule would be to create a node for every paragraph, or to insert
the SGML generic identi(cid:12)ers as node types.
Prior to insertion into the hyperbase, term frequency information is collected and
further link generation occurs. At present, this additional link generation is of the
simplistic nature described in Section 3.
Finally, the data can be inserted into the hyperbase.
The hyperbase architecture on which our prototype is based is the three-layer
23
modeldescribedbyZobeletal. For thedatabase layer,weuseTitan+,as discussed
earlier.
Of the layers present in the model, it is the middle, or node layer, that is most
under-developed in our prototype. At present, its tasks are to coordinate activ-
ity between the database layer and the display layer, determineappropriate database
queries,reconstructfragmentedtext,and monitorbrowseand querywindowrequests.
It currently lacks both the node query language, and the formal node organisation
description present in the model. It is expected that the DTDs of documents present
in the hyperbase would be used to assist in the creation of a node organisation de-
scription for that hyperbase.
Forthetop,orinterface,layer,wearecurrentlyusinganSGMLbaseddisplaytool,
16
lector. Lector is a display utility that has a simple model of text and is designed
to display text that has been marked up with SGML tags. It is an X.11 application
(cid:15) the number of documents in which a given term appears (the database fre-
quency), and
(cid:15) the number of timesa termappears within each document (the termfrequency)
18,4
be known. The term frequencies are determined at the time of creation of the
database, and are stored with the document in a nested table of term, frequency
pairs. The presence of this information separate from the body of the data facilitates
indexing. The database frequency information is stored in a separate table in the
database, and is derived from the within-document frequencies after the database is
created.
This implementation of ranking is not problem-free. Ranking performance drops
when documents are fragmented in relatively small sections. Two major options are
8
apparent. One is that outlined by Frisse. His suggestion was that node ranks could
be calculated recursively. The ranked value of a node was determined from its own
similarity to the query, in combination with combined weight of each of its children's
similarities divided by the number of children. An alternative to this is to create an
abstract description summarising document topic at the time of node generation. A
bonus of carrying out this `abstracting' at the same timeas the input text is parsed is
that the full document structure is available. This structure may well provide useful
15
information for any abstracting operation.
The availability of a document abstract could also be useful in other ways. As
previouslydiscussed, such an abstract might be used during the linkgeneration phase
to help determine appropriate inter-document and inter-node links. It could also be
informative to a user browsing or querying the database, who could use it, say, to
help determine which nodes of those which satisfy a query are actually of interest.
Alternatively, it might be useful if there is a request for a summary of a node, or
document, or group of nodes or documents.
6 Hypertext Display
Because SGML DTDs do not actually say anything about how to interpret or present
thetextthatitde(cid:12)nes,itispossibletotakethethetagged textand displayitinwhat-
ever way is most appropriate given the constraints of window size, user preference,
and data content and structure.
For example, a large document could be presented both as full text in a scrollable
window, and as a table of contents in another window. We can dynamically embed
in the text tags and data representing available hypertext links, that then can be
displayed appropriately. The level of information available and the way in which it is
presented can be varied, perhaps at a user's request, merelyby treating those tags in
a di(cid:11)erent way. In the second case, the SGML would be interpreted to display only
the various section titles, perhaps indented according to nesting. In fact, it would
be possible to choose to only display certain instances of section headings, with a
particular sub-tree, and to vary which instances and what depth.
10
This allows us to easily present a (cid:12)sh-eye view of the document. Such a view
provides (cid:12)ne detail within a small area, but only coarse structure further away from
Typically, a user viewing a section of the hyperbase will see a link that he or she
wishes to traverse. The user then activates that link, presumably by selecting it with
a mouse. Let us assume that the node id of the destination of the link is `1234,' and
that the node is the start of a document section.
This requires the reconstruction of an entire subtree. A simple approach | a
recursive traversal of the subtree | would result in many database queries for what
is a common operation. A more e(cid:14)cient approach is to retrieve the whole subtree in
one operation. In our example, this is achieved by a single TQL query that retrieves
all nodes that have an ancestor link to node 1234.
This tree can then be recreated by inspecting each retrieved node's parent id and
sibling order. From the tree, the SGML tagged text can re-created by making a
inorder traversal. Any embedded links associated with the retrieved nodes can be
displayed by inserting appropriate tags and data into the text, that can then be
presented via the user interface. For further discussion of tree retrieval, see Mackie
13
and Zobel.
5.2 Querying
The second method available for accessing the database is the issue of sophisticated
high-level queries. Queries can involve speci(cid:12)cation or constraints on content, struc-
ture, and links. The (cid:12)rst of these is found in normal information retrieval systems.
Examples of these types include:
content: Find nodes containing the words (`computer' or `system') and
`parallel'
structure: Find documents whose titles contain `database systems' and
that have diagrams
links: Find documents that cite this document and that are about
conceptual clustering
In order to support these complex queries, a natural language understanding en-
gine is could be used to identify the references to document structure and link type.
Alternatively, a form-based or graphical query interface might be used.
There are problemsinvolved in querying aggregate structures such as `documents'
and querying based on relationship. A document may be broken into fragments that
togethersatisfya querybutnot separately. Aswell,e(cid:14)cientqueryformulationstrate-
gies need to be devised to handle multiple-stage queries. The task of handling these
queries would be eased by the presence of a node query language, that better re(cid:13)ects
the typical high-level queries that are likely to be encountered. A node organisation
description that provides a formal view of the document and hyperbase structures
would also be helpful, both as a guide to the formulation of user queries, and in
the task of mapping those queries into low-level database queries. A description of
9
a node query language can be found in Fuller et al., and the need for such a node
organisation description in a formal hyperbase architecture is discussed by Zobel et
23
al.
In order to provide ranked querying, certain information is required. The cosine
measure and its accompanying term-weighting formulae require that
HyperText[
node id INTEGER NOT NULL,
node type TEXT,
parent id INTEGER,
sib order INTEGER,
links[
dest id INTEGER,
link type TEXT, :::
],
data TEXT, :::
]
Figure 1: Basic Schema
makes it possible to retrieve an entire subtree of the parse tree with one query, and
greatly simpli(cid:12)es the task of displaying documents. The type of each link is noted in
the (cid:12)eld link type and can be useful in many hypertext operations. This includes the
placement of constraints on hypertextqueries,the provision of additional information
as to how a link should be displayed, and helping to determineany action to be taken
when a link is traversed.
Two alternatives existed for management of links. These were to store all link
information in a separate link table (which would be the only choice with a relational
database), or to store all link information in the nodes in which the links are an-
chored. For e(cid:14)cient database queries, especially those relating to movement around
a document, we chose to implement links as a nested table in the node table.
5 Browsing and Querying
When the hyperbase is (cid:12)rst accessed, the nodes at the top of the database are dis-
played. Fromthat point, there are two ways of exploring the hyperbase. One of these
is by making high-level queries, and the other is by browsing.
5.1 Browsing
Browsing can bedone byselectinga table of contents entry and movingto the subtree
rooted at the individual node having that title, by scrolling through the current
document, or by following a link.
In each case the e(cid:11)ect is to select the nodes from the database that allow the
creation of a partial document tree. A portion of the document is reconstructed and
passed to the display manager. At the same time, the table of contents is updated in
just the same way, possibly using the same data (discussed in Section 6).
nodes together. This linking is via circlesof nodes, or `daisy chaining.' A set of words
in the vocabulary of the database is selected as being interesting. This can be done
18
using term weighting functions based on term frequency information.
4 Hypertext Storage
Most hypertext research has involved purpose-built hypertext software without an
underlying theory, and has focused on the mechanics of hypertext and principles of
5
human-computer interaction. Moreover, most hypertext systems have not been de-
signed to cater for large collections of data. Although some work has been directed
towards a theory of hypertext (for example, the abstract algebra of hypertext opera-
11
tions described by Garg, and the abstract hypertext engine described by Campbell
3
and Goodman ), hypertexthas not generallybeenformallyassociatedwith data stor-
age and retrieval techniques. In contrast, database systems are very much concerned
with storage and retrieval of data. They provide techniques for organising data,
managing data, retrieving data, and such facilities as transaction and concurrency
management. The data is generally organised with regard to a formal model. The
7
existence of this model permits formal de(cid:12)nition of query and update languages.
The importance of incorporating such database technology in hypertext systems is
17
discussed by Rossiter et al.
There are several requirements that a database system must satisfy in order to
be usable as the database component of a sophisticated hyperbase system. Firstly,
and most importantly, the database system must have support for retrieval of text
based on its content. Secondly, the actual database systemeithershould have embed-
ded ranking techniques, or should be able to easily provide the information needed.
Thirdly, because of the nature of text, the database system should be able to store
23
variable-sized entities. These questions are addressed by Zobel et al.
21
We have chosen to use a nested relational database system, Titan+, that has
been designed to support text-based applications. The nested relational model di-
rectly supports the hierarchical data structures required to represent documents, and
a number of text operators and access methods have been included in Titan+ to sup-
port retrieval from very large text databases. The model allows links and nodes to
22
storedtogetherintheone tuple. Aqueryandupdatelanguage TQL providesa com-
mand language interface to Titan+. TQL is an extension of the standard relational
language SQL, designed to support nested relations.
We need to store the nodes, their types, the location of the node with respect to
other nodes. Figure 1 presents a TQL schema containing the information felt to be
necessary to maintain a hyperbase.
The node type may re(cid:13)ect its position in the tree structure | whether it is the
top of a document, or an important document sub-tree, such as a section or chapter,
or a paragraph (which could be a leaf node). Alternatively, the type may indicate
that it is a user's annotation node, or a node formed as the result of a query. The
location of the node within the tree is given by parent id and sib order. The nested
table of links contains links to all related nodes and all ancestors of the node. This
Description:7] C.J. Date. An Introduction to Database Systems, 4th edition. Addison-Wesley. Publishers, USA, 1986. 8] M. E. Frisse. Searching for information in a hypertext techniques, and a nested relational database system. links between references to signi cant entities in Victorian politics was created.