Evaluating Automatically Generated Hypertext Versions of Scholarly Articles

Sidebar: About IR Methods

James Blustein

Department of Computer Science
University of Western Ontario
London, Ontario, N6A 5B7
Canada
E-mail: jamie@csd.uwo.ca or jamie@acm.org

Information retrieval (IR) systems are used to select relevant documents from document collections in response to users' requests. A user's request is called a query in IR. The systems determine which documents are relevant by comparing compact representations of documents and queries. Documents whose representations are most similar to a given query are retrieved. The representations are made to store only the essential information in the documents and queries.

Terms and Weights

Significant words in a document are called terms. In both IR systems I use the representation of documents is by a term vector. Term vectors consists of numerical weights of terms in the document collection. The more significant the term is the greater the weight that term will have. The significance of a term is determined by the part it plays in the collection and individual documents. The term vector for a particular document may include entries for terms that are not present in that document because those terms occur elsewhere in the document collection.

Cornell's SMART System

SMART is a major IR system with a long history. In SMART if a term does not appear in a document then it has a weight of zero. Typically, before term vectors are created much pre-processing is done: common words are removed from the collection, words are replaced by their morphological stems, before terms are automatically selected.

Bellcore's LSI

The LSI approach differs significantly from SMART. LSI approach attempts to determines the relationship between terms through more than simple co-occurrence. Furthermore the only pre-processing LSI requires is the removal of common words from the collection before indexing.

LSI represents a document collection with a factor-based model that represents the most important factors in the collection. This model represents each document as a k-element vector where k is the number of factors found in the document collection. The model can be represented geometrically in k-space by k orthogonal dimensions. Words used in the same documents are assumed to be related to each other by the topic of the documents. Such terms are placed close to each other in a representational model. Position in the k-space serves as a kind of indexing, i.e. the closer two items are in the space the more related they are.

Making Semantic Links

My documents consist of sections, paragraphs, sentences, and word groups within scholarly articles. Word groups are consecutive words in sentences that are potentially more indicative of meaning than their individual component words. I automatically identify these groups using Li's method [Li93].

Abstract/Conclusion links

Links are made from summary sections, such as the abstract and conclusion, to the part of the text that they summarize. I assume that single sentences in those sections summarize whole paragraphs and sections in other parts of the document. Although that assumption is not always correct [DRM89] I must make it if there are to be reasonable links from the abstract. I further assume that the first two sentences of any paragraph provide a good indication of the purpose of the paragraph. My view is somewhat supported by evidence of some people studying English rhetorical structures [Coe88a] and by linguistic theorists [Kie80.] With these assumptions I believe it sound to make links from abstract to paragraphs based on the similarity of vocabulary between the abstract and the first two sentences of the paragraph.

Scattered Discussion Links

I am using methods developed at Cornell to link passages that discuss the same topics but are separated in the text [SA93,All95]. In simplified form the method is to compare word groups and sentences to progressively smaller text units (sections, subsections, paragraphs, and sentences) to find similar passages for linking. Links are only made to passages with a vocabulary match in at least one subunit, e.g. a sentence would be a subunit of a paragraph. Links are made to the smallest matching unit.

I use other rules to avoid confusing readers with too many links: I allow only one link to a passage from within any paragraph. I do not allow links to adjacent paragraphs unless they are in different sections. I may find that I must limit the number of links per paragraph as well.