I have been working on the following three projects:
Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora. A paper based on this work was presented in the the Seventh ACM International Workshop on Web Information and Data Management (WIDM'05), Bremen, Germany, November 5, 2005.
Automatic Web Site Summarization, which generates a concise and informative summary of a given Web site, is becoming more important for effective information management on the rapidly growing World Wide Web. Extraction-based approaches to Web site summarization rely on the extraction of the most significant sentences from the target Web site based on the density of a list of key phrases which best describe the entire Web site. In this work, we benchmark five methods, TFIDF, KEA, Keyword, Keyterm, and Mixture, for key phrase extraction in the automatic Web site summarization task. We investigate the performance of these methods via a formal user study and demonstrate that Keyterm is the best method for extracting key phrases while Mixture is the best one for obtaining key sentences. A paper based on this work is in preparation for submission to a journal.
For my Ph.D. research, I proposed an approach for effective summarization of Web sites with diverse topics and heterogeneous content. The main objective is to develop a system, which can take advantage of both narrative content and link information embedded in a given Web site and create a hierarchical summary. The system is novel in that it aims to perform automatic summarization via construction of a topic hierarchy, which involves design and application of techniques such as keyword extraction, text classification, text clustering, and hyperlink analysis. These have been topics of renewed interest in the IR community. Furthermore, the proposed approach has potential to become an effective means of visualizing large Web sites and lead to enhanced IR systems searching for Web sites, where, for example, summaries of Web sites are indexed and presented to the user as the text snippets associated with the query results. A paper based on the thesis proposal, which was defended in late 2004, was presented in the SIGIR'05 Doctoral Consortium.