Geographic Entity Extraction from the Web

Thesis: High Accuracy Postal Address Extraction from Web Pages

Address Dataset can be downloaded from here.

Source code of the Data Collector

Address Extraction Demo: Regular Expression Based System, Machine Learning Based System

Presentation at MALNIS, Jun 27, 2003

  • Presentation PPT version, PDF version
  • Presentaion on Information Theoretical Co-Clustering - May 20, 2004, latex

    Geographic Search

  • Applications
  • Datasets
  • Papers

    Information Extraction




  • Dalhousie NLP Group
  • NLP Course,NLP Course and NLP Group at Stanford, NLP Code and Data resources, including some examples in Java.
  • Information Retrieval Resouces Link
  • CRL Computing Research Laboratory UIUC course 498

    Link Analysis

  • When experts agree: using non-affiliated experts to rank popular topics (2002)
  • Hilltop: A Search Engine based on Expert Documents (2000)
  • Who Links to Whom: Mining Linkage between Web Sites.(2001) by Krishna Bharat
  • Improved Algorithms for Topic Distillation in a Hyperlinked Environment (1998)
  • The connectivity server: Fast access to linkage information on the Web.(1998)
  • Automatic resource compilation by analyzing hyperlink structure and associated text (1998)
  • The Quest for Correct Information on the Web: Hyper Search Engines
  • Graph structure in the web
  • Finding Related Pages in the World Wide Web (1999)
  • IBM Almaden Webfountain

    Parallel Clustering

  • A Hybrid Parallel Web Document Clustering Algorithm and Its Performance Study (2003)
  • Parallelizing the Buckshot Algorithm for Efficient Document Clustering (2002)
  • Clustering and Classification of Large Document Bases in a Parallel Environment (1997)


  • Efficient Clustering of Very Large Document Collections (2001)
  • Principal Direction Divisive Partitioning (1997) - PDDP project
  • An Analysis of Recent Work on Clustering Algorithms (1999)
  • Survey Of Clustering Data Mining Techniques (2002)
  • Overcoming the Curse of Dimensionality in Clustering by Means of the Wavelet Transform (2000)
  • Oren Zamir. Web Document Clustering: A Feasibility Demonstration (1998)
  • Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections (1992)
  • Feature Selection and Document Clustering
  • Charu C. Aggarwal, Philip S. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces (2000)
  • The Challenges of Clustering High Dimensional Data
  • A matrix density based algorithm to hierarchically co-cluster documents and words (2003)
  • Assessment and Pruning of Hierarchical Model Based Clustering (2003)
  • Why so many clustering algorithms
  • Projections For Efficient Document (1997)
  • Document Clustering with Cluster Refinement and Model Selection Capabilities (2002)
  • Shenghuo Zhu's Publications

    Information Retrieval

  • iVia tools
  • Great Site Ranking in Google The Secrets Out
  • A Survey On Web Information Retrieval Technologies (2000)
  • Algorithmic Challenges in Web Search Engines Monika R. Henzinger
  • Pivoted Document Length Normalization (1996)
  • Iterative methods for sparse linear systems
  • Matrices, Vector Spaces, and Information Retrieval
  • Singular value decomposition M.W. Berry, S.T. Dumais, G.W. O'Brien. Using Linear Algebra for Intelligent Information Retrieval (1995)
  • Scientific Computing: Fundamentals and Applications
  • Two Algorithms for Nearest-Neighbor Search in High Dimensions (1997)

    Semantic Web

  • Latent Semantic Indexing, An intesting discussion
  • CIRCA whitepaper

    Parallel Computing

  • Parallel computation of the singular value decomposition
  • MapReduce: Simplified Data Processing on Large Cluster

    Search Engine

  • HillTop Ranking
  • Yahoo! Research Lab
  • Search Engine Watch
  • Labin - A Multipurpose Crawler Shell
  • Papers written by Googlers
  • Google File System
  • Crawler links
  • Building a Vector Space Search Engine in Perl

    Data Mining & Machine Learning

  • Mining and Knowledge Discovery from the Web
  • Very Large Data Bases (VLDB) Conference
  • Journal of AI Research JAIR
  • SIGKDD Explorations
  • Data Mining at UTCS
  • Inderjit S. Dhillon. Publications by Mohammed Javeed Zaki
  • Research at Microsoft
  • IBM Almaden Research Center
  • A Roadmap to Text Mining and Web Mining
  • Stanford Publication Server
  • Clustering Large Dataset
  • gSpan, Souce Code
  • How to Implement SVMs - by J. Platt, IEEE Intelligent Systems Magazine, Trends and Controversies, Marti Hearst, ed., vol 13, no 4, (1998).
  • AI-SPECIFIC Software RESOURCES - Collected By Evangelos E. Milios


  • Tom Mitchell
  • Jiawei Han
  • Vipin Kumar
  • David Skillicorn
  • Information Theory, Inference, and Learning Algorithms

  • Brainstorming, Influence, and Icebergs
  • Accidental Algorithms

    Last Update: September 13, 2004 11:12 AM by Zheyuan Yu