MALNIS Digital Library

Paperfind Search Engine
username: user, passwd: (by email)

Diagram

Tasks

0. Get and post Pimenas presentation slides. Paperfind installation on apollo:~pimenas/paperfind
All upload their collections to their apollo directory ~<home>/public_html/Readings

1. Write a script to collect papers from individual collections (uploaded to apollo) and recompile Lucene index.

2. Convert pdf to txt (Use pdfbox) [Check how pdf notes get translated into text - ligatures].
The pdfbox package contains an executable file ExtractText.exe that does the trick. Notes in the pdf file do not appear in the txt output.

3. Metadata extraction:
Question: Is it possible/easy to automatically retrieve from existing resources the metadata record corresponding to a given pdf file?
- Check Citeseer OAI records (API ).
- Check DBLP (once article is found, bibtex entry is available. Abstracts accessible through EE link, ACM, IEEE, Springer formats. DBLP contains list of coauthors/coauthored papers. Useful for constructing coauthorship networks.)
DBLP search by title gives nothing if a word not present in the title is included in the query.
Task 1: Download DBLP XML records and Citeseer OAI records and index them as one document per paper (e.g. in Lucene).
Search the index using the first few words of the pdf->txt file as the query. If nothing returned, use fewer words in the query.
If more than one answer, ask a human to pick the correct record.
Task 2: Establish rule that each group member should manually add metadata into the pdf file properties for each new paper he/she reads
(possible in Acrobat, there may be free programs for doing the same).
Acrobat: File -> Document Properties -> Description
Title, author, keywords (put abstract here),
File -> Document Properties -> Custom
Insert properties (Name, Value) corresponding to bibtex properties.
To automatically extract the properties of a pdf file, a utility based on the Acrobat SDK may be available.

Look into the Daffodil project. They apparently already have metadata extraction facilities from more than one source (DBLP, Citeseer, plus others).
Check out the Daffodil Wiki.

- Brief update on new Citeseer project.

- define follow-on project for UG student. Task 1 above is suitable.

4. Term extraction for doc and cluster summarization using KEA or KEA++. (See also Olena Medelyan's home page)
Use save-as-text feature of Acrobat or the ExtractText.exe program of pdfbox to convert a few documents by hand until task #2 is done.

5. Manual cleaning of terms extracted from step #4

6. Construction of term-document matrix, clustering using the Bow toolkit.
Try LDA and compare results with standard method.

7 . Design simple front end for viewing clustering results

Look into Google Books for ideas.

Longer-term projects (not in the diagram)
0. Distributed clustering
1. Peer-to-peer document sharing (Hathai)

Relevant Links

Digital Libraries
Citeseer -- Next Generation Citeseer (CiteseerX)
Google Scholar
Rexa
- FAQ Refs
DBL Browser
Beagle++ : Toolbox: Towards an Extendable Desktop Search Architecture (see here)

Personal Research literature organizing / sharing
Citeulike
mekentosj.com (facilities for organizing pdf libraries of scientific articles online)

Semantic Desktop
SemDesk 2006 Semantic Desktop and Social Semantic Collaboration
Nepomuk: Social Semantic Desktop

Text Mining Software
KEA
Judge
GATE

Infrastructure software
DSpace
Greenstone

MediaWiki (to support collaboration on this project)

Social exchange Web services
Sharing references: http://del.icio.us/
Sharing bookmarks: http://www.citeulike.org/
Sharing bookmarks and references:http://www.bibsonomy.org/

Metadata
Open Archives Initiative for Metadata Harvesting
Dublin Core Metadata Initiative