Lithel

Yongzheng Zhang

Automatic Web Site Summarization

I have been working on the following three projects:

Narrative Text Classification for Automatic Term Extraction in Web document Corpora

Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora. A paper based on this work was presented in the the Seventh ACM International Workshop on Web Information and Data Management (WIDM'05), Bremen, Germany, November 5, 2005.

A Comparative Study on Key Phrase Extraction Methods in Automatic Web Site Summarization

Automatic Web Site Summarization, which generates a concise and informative summary of a given Web site, is becoming more important for effective information management on the rapidly growing World Wide Web. Extraction-based approaches to Web site summarization rely on the extraction of the most significant sentences from the target Web site based on the density of a list of key phrases which best describe the entire Web site. In this work, we benchmark five methods, TFIDF, KEA, Keyword, Keyterm, and Mixture, for key phrase extraction in the automatic Web site summarization task. We investigate the performance of these methods via a formal user study and demonstrate that Keyterm is the best method for extracting key phrases while Mixture is the best one for obtaining key sentences. A paper based on this work is in preparation for submission to a journal.

Automatic Concept Hierarchy Construction for Effective Management of Web Site Contents (Ph.D. thesis)

For my Ph.D. research, I proposed an approach for effective summarization of Web sites with diverse topics and heterogeneous content. The main objective is to develop a system, which can take advantage of both narrative content and link information embedded in a given Web site and create a hierarchical summary. The system is novel in that it aims to perform automatic summarization via construction of a topic hierarchy, which involves design and application of techniques such as keyword extraction, text classification, text clustering, and hyperlink analysis. These have been topics of renewed interest in the IR community. Furthermore, the proposed approach has potential to become an effective means of visualizing large Web sites and lead to enhanced IR systems searching for Web sites, where, for example, summaries of Web sites are indexed and presented to the user as the text snippets associated with the query results. A paper based on the thesis proposal, which was defended in late 2004, was presented in the SIGIR'05 Doctoral Consortium.

Selected Reading List

S. Chakrabarti et al. Mining the Web's Link Structure. IEEE Computer, 32(8):60-67, 1999.
G. Flake, S. Lawrence, and L. Giles. Efficient Identification of Web Communities. In Proceedings of ACM SIGKD'00, pages 150-160, 2003.
E. Glover et al. Using Web Structure for Classifying and Describing Web Pages. In Proceedings of WWW'02, pages 562-569, 2002.
D. Lawrie, B. Croft, and A. Rosenberg. Finding Topic Words for Hierarchical Summarization. In Proceedings of ACM SIGIR'01, pages 349-357, 2001.
D. Lawrie and W. Croft. Generating Hierarchical Summaries for Web Searches. In Proceedings of ACM SIGIR'03, pages 457-458, 2003.
W. Li et al. Constructing Multi-Granular and Topic-Focused Web Site Maps. In Proceedings of WWW'01, pages 343-354, 2001.
M. Sanderson and B. Croft. Deriving Concept Hierarchies from Text. In Proceedings of ACM SIGIR'99, pages 206-213, 1999.
P. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2(4):303-336, 2000.
P. Turney. Coherent Keyphrase Extraction via Web Mining. In Proceedings of IJCAI'03, pages 434-439, 2003.
Y. Wang and M. Kitsuregawa. Use Link-based Clustering to Improve Web Search Results. In Proceedings of WISE'01, pages 115-124, 2001.
Y. Wang and M. Kitsuregawa. Evaluating Contents-link Coupled Web Page Clustering for Web Search Results. In Proceedings of ACM CIKM'02, pages 499-506, 2002.
I. Witten, G. Paynter, E. Frank, C. Gutwin, and C. Nevill-Manning. KEA: Practical Automatic Keyphrase Extraction. In Proceedings of ACM DL'99, pages 254-255, 1999.