The project for this course is to build an information retrieval system based on the Vector Space Model. Note that we will use tf.idf weights.
The purpose of this milestone is to build your file structures for the Vector Space information retrieval system, and to introduce you to gdbm. gdbm is a hashed access file utility available with unix. It hashes on the key so that you can store/retrieve a set of tags/ptrs.
1. There are 3 basic aspects to this assignment:
This inverted file will be used for processing queries.
2. From the results of milestone 1, select a set of noise words that are NOT to be included in the inverted file of keywords. However, do not waste time going through your entire zipf file looking for words to delete. Concentrate on the high frequency noise words. You might also be interested in the set of stop words developed by the Smithsonian. They are listed at the end of this milestone, in descending order of frequency. You might also be interested in a larger set of stop words and stemming algorithms.
You should also consider dropping words of very low frequency in the database, and perhaps of using Porter's stemming algorithm.
3. There are a number of different ways of designing your files using the gdbm utility. I suggest the following, but you are free to design your own.
The file will be the inverted file of keywords. For each keyword, you will need the start and end byte offsets for each news item in which the keyword appears. The start and end byte offsets are for both the similarity operations and the display of the news item in the interface. This can also be done with keeping only the start byte offset.
4. You will also have to build the file representing the document vectors, where each vector contains only those terms with non-zero term weights. Each term must have its tf.idf weight. This can also be a gdbm file with the start byte offset being the key field and the terms and weights in the data field.
1. Please see the gdbm man pages on torch. There are also a number of tutorials available on the web. Also there are example C programs to show you how to create a gdbm file and how to read from it.
The following list of 27 noise words was developed by the Smithsonian.
They account for approximately 33.3% of word usage in English language
abstracts. In descending order of frequency of occurrence:
the, of, and, to, in, a, be, will, for, on, is, with, by, as, this, are, from, that, or, at, been, an, was, were, have, has, it