Lithel

Hongyu Liu

Focused Crawling on the Web using Probabilistic Models

My research interests are in areas of machine learning, information retrieval, Web mining and applications. I am currently working on Web crawling and searching using probabilistic approaches and graphical models.

A focused crawler is an efficient tool to traverse the Web to gather documents on topics. Focused crawlers must use information gleaned from previously crawled page sequences to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modelling of context as well as the current observations. Probabilistic models, such as Hidden Markov Models (HMMs) and Conditional Random Fields(CRFs), can potentially capture both formatting and context.

We propose a new approach for focused crawling to capture sequential patterns leading to targets based on probabilistic models. We model a focused crawler as a random surfer, over an underlying Markov chain of hidden states, defined by the number of hops away from targets, from which the actual topics of the documents are observed. When a new document is seen, the prediction is to estimate the distance this document is away from a target. Our approach is to use a combination of content analysis and link structure of paths leading to targets by learning from training data and to emulate them to find more relevant pages. With HMMs, we focused on semantic content analysis with sequential patterns learned from user's browsing behavior on specific topics. Currently we are studying and extending our work with CRFs to combine multiple overlapping features extracted from Web pages. The advantages and flexibility of CRFs fit our approach well and are able to represent useful context including not only text content, but also linkage relations.

We are also looking for different potential applications based on our learning model such as personalized search tool, Web portals and recommendation systems.