CSCI 6702 Parallel Computing Project Web Page


Project: Parallel Document Clustering

Group: Deyun Gao, Xiaohu Li, Zheyuan Yu

Emails: dgao@cs.dal.ca; xiaohu@cs.dal.ca; zyu@cs.dal.ca

Topic Description:

The document vectors are very high dimensional because even a small document collection may have thousands unique terms. High dimensionality poses a challenge to document clustering algorithms. K. Beyer et. al. [Beyer1999] have shown that in high dimensional space, each pair of points is almost the same for a wide variety of data distributions and distance functions. In such situation, the similarity measure of the clustering algorithms do not work efficiently, hence the meaningfulness of clustering may be doubtful. This problem was traditionally referred to as dimensionality curse [Bellman1961].

Also large collections of documents are becoming increasingly common. The public internet currently has more than 3 billion web pages, while private intranets also contain an abundance of text data. It is a great challenge to efficiently cluster such huge amount of document collection. The use of parallel computing techniques in large scale document clustering is unavoidable.

In this project, our main concern is in obtaining an effective document clustering algorithm and implemented it in parallel to get a high efficient process for clustering very large document collections.

Literature Survey:

PS, PDF, LaTex

Slides: prsentation.pdf

Final Report: