Roger Zhang

Mixed Length Character N-grams in Text Clustering

Character N-grams have been proved a very effective feature representation for web mining problems using the vector spacemodel [1]. However, because traditional solutions in the literature use only fixed length N-grams -- the sliding window for N-gram extraction has a fixed width (typically between 3 and 6inclusive), the potentials of variable or mixed length N-grams have not been fully explored. In this research, we construct profiles with mixed length N-grams using expansion methods based on entropy and probability theory, and evaluate the effect of the resulting representations in the context of automatic text document clustering, and feature-document co-clustering. More specifically, we start with a low level fixed length N-gram (e.g. trigram) profile, and use an information gain measure to guide the expansion of each member N-gram. More than one expansion measure can be used for comparison. The resulted collection of mixed length N-grams is then used as the feature set for TF-IDF weight calculations. For the evaluation part, we apply well established algorithms such as standard K-means and Information-theoretic Co-clustering [2] on frequently used datasets in the literature such as Classic 3 and Reuters [3]. Quality of clustering solutions can be compared with that from fixed length N-gram representations as well as term representations.

References:

1. Y. Miao, V. Keselj, and E. Milios, "Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Word-based Clustering", Proceedings of ACM Fourteenth Conference on Information and Knowledge Management (CIKM 2005), Bremen,Germany, 2005.

2. I.S. Dhillon, S. Mallela, and D.S. Modha, "Information-theoretic Co-clustering", Proceedings of The NinthACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.

3. D. Lewis. Reuters-21578 Text Categorization Test Collection, 1997.