CSCI 6702 Parallel Computing Project Web Page
Project: Parallel Document Clustering
Group: Deyun Gao, Xiaohu Li, Zheyuan Yu
Emails: dgao@cs.dal.ca; xiaohu@cs.dal.ca; zyu@cs.dal.ca
Topic Description:
The document vectors are very high dimensional because even a
small document collection may have thousands unique terms. High
dimensionality poses a challenge to document clustering
algorithms. K. Beyer et. al. [Beyer1999] have shown that in high
dimensional space, each pair of points is almost the same for a
wide variety of data distributions and distance functions. In such
situation, the similarity measure of the clustering algorithms do
not work efficiently, hence the meaningfulness of clustering may
be doubtful. This problem was traditionally referred to as
dimensionality curse [Bellman1961].
Also large collections of documents are becoming increasingly
common. The public internet currently has more than 3 billion web
pages, while private intranets also contain an abundance of text
data. It is a great challenge to efficiently cluster such huge
amount of document collection. The use of parallel
computing techniques in large scale document clustering is
unavoidable.
In this project, our main concern is in obtaining an effective
document clustering algorithm and implemented it in parallel to
get a high efficient process for clustering very large document
collections.
Literature Survey:
PS, PDF, LaTex
Slides: prsentation.pdf
Final Report: