[ Vlado Keselj's home page ]
NLP Research Links
- General Links
-
- Central Resources
- ACL Anthology - A Digital Archive of Research Papers in Computational Linguistics
- Associations
- Journals
- Institutions and Groups
- Conference links
- E-mail lists
- NLP Courses
- General Human Language Resources
- Other General Resources
- Speech processing
-
- CMU Sphinx - The CMU Sphinx Group Open Source Speech Recognition Engines
- eSpeak - text to speech open source software
- The Festival Speech Synthesis System - by the Centre for Speech Technology Research, Univ.of Edinburgh
- Festival at CMU
- PRAAT: Phonetics and Speech Tools
- The EMU Speech Database System
- Example of prosodic annotation data
- ToBI Annotation document
- Phonology
- Commercial
- N-gram analysis
-
- Text::Ngrams Perl Package - Flexible Ngram analysis (for characters, words, and more); on CPAN; by Vlado Keselj
- Ngram Statistics Package in Perl, by T. Pedersen at al.
- Text::Ngram Perl Package by Simon Cozens
- Perl script ngram.pl by Jarkko Hietaniemi
- Waterloo Statistical N-Gram Language Modeling Toolkit - in C++ by Fuchun Peng
- Suffix Arrays for Ngrams
- Suffix Arrays - description and implementation by Douglas McIlroy, with implementation by Sean Quinlan and Sean Dorward
- Preprocessing
-
- Transcription and Encoding Schemes
- Sentence Splitters (sentencizers, sentence boundary detectors)
- Word segmentation
- Stop-word Removal
- Morphology
-
- Letter-to-Phoneme Pascal Challenge 2006
- Unsupervised segmentation of words into morphemes -- Challenge 2005
- Using eigenvectors of the bigram graph to infer morpheme identity - by Mikhail Belkin and John Goldsmith, 2002
- Stemmers and Lematizers
- Finite State Methods
-
- intex - Linguistic Development Environment by Max Silberztein
- UNITEX - Corpus processins sytem, GPL licence, implemented in C/C++ and Java; spawned from the same project as intex and NooJ
- Nooj - Linguistic development environment - Related to intex, by Max Silberztein.
- JFLAP - Java Formal Langauge and Automata Package
- POS Tagging
-
- Software Plaza:Brill's tagger
- AI repository: Brill's tagger
- TnT -- Statistical Part-of-Speech Tagging
- QTag - a probabilistic POS tagger - language independent, implemented in Java, by Oliver Mason (c) 1994-2003
- Ingo's collection of POS taggers
- CLAWS part-of-speech tagger
- UCREL CLAWS7 Tagset
- AI repository: taggers
- Brazilian Portuguese POS Tagger (based on QTAG)
- Lingua::EN::Tagger - POS tagger for English; Perl module; uses HMM bigram word model; by Maciej Ceglowski and Aaron Coburn
- POS tagger in Perl/Tk by Kristie Seymore
- BNC POS tags
- A Practical Part-of-Speech Tagger (1992) - Cutting et al.; Xerox code, Application to Spanish
- SVMTool - Open source POS tagger based on Support Vector Machine
- Stanford Log-linear Tagger
- Document Clustering
-
- Tutorial on clustering Large and High-Dimensional data by Nicholas et al. - On CIKM 2003
- Clustering and Segmentation software on KDnuggets
- CLUTO - Software Package for Clustering High-Dimensional Datasets
- Matlab Clustering Package by Frank Dellaert
- Terminology Extraction
-
- Gensen Web - An automatic domain terminology extraction system
- Text Categorization (TC)
-
- Spam detection and E-mail classification
- Encoding identification
- Language identification
- Sentiment classification
- Authorship attribution and Plagiarism detection (AATT)
- Topic categorization
- Other
- Text Summarization
-
- Text Summarization site - by Dragomir Radev
- "Statistics-Based Summarization --- Step One: Sentence Compression,"
(K. Knight and D. Marcu), National Conference on Artificial
Intelligence (AAAI), 2000.
- Dictionaries and Lexicons
-
- ACL-SIGLEX - ACL Special Interest Group on Lexicon
- Dictionary development
- On-line dictionaries
- Lexical Semantics
-
- WordNet
- Word Sense Disambiguation (WSD)
- Unification
-
- Theory
- "Unification: A Multidisciplinary Survey," Kevin Knight, ACM Computing Surveys, 21(1), pages 93-124, 1989.
- Practice: Unification-based Systems
- Grammar Formalisms
-
- Unification-based grammars
- Head-driven Phrase Structure Grammar (HPSG)
- Lexical Functional Grammar (LFG)
- Stochastic Unification-based Grammars
- "Stochastic Attribute-Value Grammars," Steven Abney, Computational Linguistics, number 4, volume 23, pp 597-617, 1997.
- Parsing (Syntactic Analysis)
-
- ALE unification-based,parser - coverage: medium
- LKB unification-based,parser - coverage: medium
- PC-PATR unification-based,parser - coverage: small
- Stefy unification-based,parser - coverage: small
- NLP Software (includes parser list)
- Parser comparison (several parsers referenced)
- Collins parser, coverage: large
- Link Grammar parser, coverage: large
- Apple Pie Parser
- Probabilistic Word Graph Parser: Java Source & Documentation, Bob Carpenter, coverage: small
- MINIPAR parser, coverage: medium
- Evalb - bracket scoring program
- Parse TreeBanks
-
- The Penn Treebank Project (English)
- NEGRA corpus (German)
- Kyoto Text Corpus (Japanese)
- Machine Translation
-
- General
- On-line translation
- MT Research
-
Building and Using Parallel Texts:
Data Driven Machine Translation and Beyond
HLT-NAACL 2003 Workshop, May 31, 2003
- "Fast Decoding and Optimal Decoding for Machine Translation"
(U. Germann, M. Jahr, K. Knight, D. Marcu, and K. Yamada), Proc. of
the Conference of the Association for Computational Linguistics
(ACL), 2001.
- "Unification-Based Glossing," (V. Hatzivassiloglou and K. Knight),
Proc. of the International Joint Conference on Artificial
Intelligence (IJCAI), 1995.
- A list of Suppliers of Machine Translation Software - by The British Computer Society Natural Language Translation Specialist Group
- Systran - Comercial machine translation; available free on-line service
- Babel Fish Translation - On-line translation, Babel Fish, AltaVista (powered by SYSTRAN)
- Information Retrieval
-
- Open Source Search Engines
- Lucene - Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Jakarta Lucene is an open source project available for free download from Apache Jakarta. Please use the links on the left to access Lucene.
- Wumpus - by Stefan Buettcher
- WebSPHINX - A Personal, Customizable Web Crawler
- A list of IR systems (ir.dcs.gla.ac.uk)
- System SMART
- OKAPI
- The Lemur Toolkit for Language Modeling and Information Retrieval
- Nutch search engine
- Zettair (once called Lucy)
- mg ("Managing Gigabytes")
- DataparkSearch Engine
- Lemur
- Andrew McCallum's Code and Data
- Introduction to Information Retrieval - by Chrisopher Manning, Prabhakar Raghavan, and Hinrich Schutze, 2007, draft available on-line
- Information Retrieval - A book by C. J. van Rijsbergen, 1979, available on-line
- Modern Information Retrieval - A book by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, contents
- Cross-Language IR
- Information Extraction
-
- Balie - A tool for multilingual information extraction
- Chelba, Mahajan: Information Extraction Using the Structured Language Model
- BioCreAtIvE Critical Assesment of IE system in Biology
- Semantic Annotation
-
- Semantic Web
- Genomics
- Semantic Annotation: Other
- Semantic Role Labeling
- XML-Related
- Ontologies
-
- SWRC - Semantic Web Research Community Ontology
- Open Cyc
- Cycorp
- SUMO - Suggested Upper Merged Ontology - link suggested by Adam Pease [Link1]
- Protege Ontologies Library
- SUMO translation for Protege frame system
- SUO
- Dublin Core
- Question Answering
-
- TREC QA
- QA Systems
- FAQ Collections
- Natural Language Generation
-
- Chatterbots in Google Directory
- NLP Tools
-
- GATE - General Architecture for Text Engineering, used in Information Extraction
- MedLEE
- OpenNLP - Open source NLP, project umbrella
- Natural Language Toolkit (NLTK) in Pyton
- FreeLing
- The Festival Speech Synthesis System - by the Centre for Speech Technology Research, Univ.of Edinburgh
- Festival at CMU
- PRAAT: Phonetics and Speech Tools
- The EMU Speech Database System
- Commercial tools
- NL Corpora and Other NL Resources
-
- Standards
- Word Lists
- N-grams
- NL Corpora - Free
- NL Corpora - Free with Licence agreement
- NL Corpora - Commercial
- Commercial Links
-
- NLP Products
- NLP Companies
- Thanks for Links
-
- Pythonner
© 2003-2024 Vlado Keselj, last update: 13-Jan-2022