Term-based Knowledge Organization and Summarization

The focus of this project is the unstructured knowledge embedded in large document collections.
Examples of large document collections with a strong knowledge content are the research literature and
the email correspondence of company employees on the corporate intranet. The terminology
associated with a document collection is the language in which the knowledge content can be
expressed. On the basis of the extracted terminology, the document collection can be clustered hierarchically.
The document hierarchy induces a hierarchical organization on the terminology, leading to a lexical
ontology for the domain.
Terms appearing frequently in a subset of the document collection capture the knowledge content of the subset,
and lead to extraction of key sentences from the documents that form, together with the frequent terms, a summary
of the subset.
If the document collection forms a networked information space through hyperlinks or references/citations, terminology extracted
from the text can be used to complement link information for knowledge organization.

Infrastructure projects:
- automatic term extraction from large (GB) document collections
- named entity extraction from large (GB) document collections
- inverted indexing
- term-based summarization of a document collection
- hierarchical clustering of a document collection based on a term-document vector representation