M. Mahdi Shafiei

Model based co-clustering in high dimensional spaces

My research interests lie in areas of machine learning with focus on natural language text processing. I'm particularly interested in the application of machine learning techniques (graphical models, kernel methods) to text representation, classification and clustering. My most recent work involves the application of different dimension reduction techniques to text dataperformance of different dimension reduction techniques in clustering text data. I'm currently studying the application of statistical models in co-clustering.

Co-clustering or simultaneous clustering of rows and columns of two-dimensional data matrices, is a data mining technique with various applications such as text clustering and microarray analysis. Most proposed co-clustering algorithms work on the data matrices with elaborate assumptions and they also assume the existence of a number of mutually exclusive row and column clusters, but it is believed that such an ideal structure rarely exists in real data.

We are looking for a co-clustering algorithm which has the following three properties : first, it should be applicable to any two-dimensional matrix. Second, it should allow overlapping clusters for rows and columns to be identified and finally, it should be able to find automatically the optimal number of row and column clusters. For the first step, we have proposed an overlapping co-clustering model which is able to work with any regular exponential family distribution, and corresponding Bregman divergences, thereby making the model applicable to a wide variety of clustering distance functions. The proposed algorithm using a generative model is able to discover overlapping co-clusters in the input data matrix.