Evaluating Automatically Generated Hypertext Versions of Scholarly Articles: Extra Text About IR Methods

James Blustein

NOTE: This document is not part of the submission. It is provided only for the convenience of the referees. This document (or a version of it) can be incorporated into the submission if desired.

IR systems are used to select relevant documents from document collections in response to users' requests. A user's request is called a query in IR. The systems determine which documents are relevant by comparing compact representations of documents and queries. Documents whose representations are most similar to a given query are retrieved. The representations are made to store only the essential information in the documents and queries.

Terms and Weights

Significant words in a document are called terms. In both IR systems I use the representation of documents is by a term vector. Term vectors consists of numerical weights of terms in the document collection. The more significant the term is the greater the weight that term will have. The significance of a term is determined by the part it plays in the collection and individual documents. The term vector for a particular document may include entries for terms that are not present in that document because those terms occur elsewhere in the document collection.

Cornell's SMART System

In SMART if a term does not appear in a document then it has a weight of zero. Typically, before term vectors are created much pre-processing is done: common words are removed from the collection, words are replaced by their morphological stems, before terms are automatically selected.

Bellcore's LSI

The LSI approach differs significantly from SMART. LSI approach attempts to determines the relationship between terms through more than simple co-occurrence. Furthermore the only pre-processing LSI requires is the removal of common words from the collection before indexing.

LSI represents a document collection with a factor-based model that represents the most important factors in the collection. This model represents each document as a k-element vector where k is the number of factors found in the document collection. The model can be represented geometrically in k-space by k orthogonal dimensions. Words used in the same documents are assumed to be related to each other by the topic of the documents. Such terms are placed close to each other in a representational model. Position in the k-space serves as a kind of indexing, i.e. the closer two items are in the space the more related they are.

Making Semantic Links

My documents consist of sections, paragraphs, sentences, and word groups within scholarly articles. Word groups are consecutive words in sentences that are potentially more indicative of meaning than their individual component words. I automatically identify these groups using Li's method [Li93].

Abstract/Conclusion links

Links are made from summary sections, such as the abstract and conclusion, to the part of the text that they summarize. I assume that single sentences in those sections summarize whole paragraphs and sections in other parts of the document. Although that assumption is not always correct [DRM89] I must make it if there are to be reasonable links from the abstract. I further assume that the first two sentences of any paragraph provide a good indication of the purpose of the paragraph. My view is somewhat supported by evidence of some people studying English rhetorical structures [Coe88a] and by linguistic theorists [Kie80.] With these assumptions I believe it sound to make links from abstract to paragraphs based on the similarity of vocabulary between the abstract and the first two sentences of the paragraph.

Scattered Discussion Links

I am using methods developed at Cornell to link passages that discuss the same topics but are separated in the text [SA93,All95]. In simplified form the method is to compare word groups and sentences to progressively smaller text units (sections, subsections, paragraphs, and sentences) to find similar passages for linking. Links are only made to passages with a vocabulary match in at least one subunit, e.g. a sentence would be a subunit of a paragraph. Links are made to the smallest matching unit.

I use other rules to avoid confusing readers with too many links: I allow only one link to a passage from within any paragraph. I do not allow links to adjacent paragraphs unless they are in different sections. I may find that I must limit the number of links per paragraph as well.

Further References

[All95]
James Allan. Automatic HypertextConstruction. PhD thesis, Cornell University, 1995.
[Coe88]
Richard M. Coe. Toward a Grammar of Passages. Southern Illinois University Press, 1988.
a] pp. 48 - 51
[DRM89]
Andrew Dillon, John Richardson, and Cliff McKnight. Human factors of journal usage and design of electronic texts. Interacting with Computers, 1(2):183 - 189, 1989.
[Kie80]
David E.Kieras. Initial mention as asignal to thematic content in technical passages. Memory & Cognition, 8(4):345 - 353, 1980.
[Li93]
Zhuoxon Li. Information Retrieval for Automatic Link Creation in Hypertext Systems. PhD thesis, Southampton University, 1993.
[SA93]
Gerard Salton and James Allan. Selective text utilization and text traversal. In The Fifth ACM Conference on Hypertext Proceedings, Seattle, Washington, USA, 14 - 18 November 1993. ACM SIGLINK, SIGIR, SIGOIS, ACM Press.