Term-based Knowledge Organization and Summarization
The focus of this project is the unstructured knowledge embedded in large document
collections.
Examples of large document collections with a strong knowledge content are the
research literature and
the email correspondence of company employees on the corporate intranet. The
terminology
associated with a document collection is the language in which the knowledge
content can be
expressed. On the basis of the extracted terminology, the document collection
can be clustered hierarchically.
The document hierarchy induces a hierarchical organization on the terminology,
leading to a lexical
ontology for the domain.
Terms appearing frequently in a subset of the document collection capture the
knowledge content of the subset,
and lead to extraction of key sentences from the documents that form, together
with the frequent terms, a summary
of the subset.
If the document collection forms a networked information space through hyperlinks
or references/citations, terminology extracted
from the text can be used to complement link information for knowledge organization.
Infrastructure projects:
- automatic term extraction from large (GB) document collections
- named entity extraction from large (GB) document collections
- inverted indexing
- term-based summarization of a document collection
- hierarchical clustering of a document collection based on a term-document
vector representation