MALNIS Digital Library
Paperfind Search Engine
username: user, passwd: (by email)
Tasks
0. Get and post Pimenas presentation slides. Paperfind installation on apollo:~pimenas/paperfind
All upload their collections to their apollo directory ~<home>/public_html/Readings
1. Write a script to collect papers from individual collections (uploaded to
apollo) and recompile Lucene index.
2. Convert pdf to txt (Use pdfbox) [Check
how pdf notes get translated into text - ligatures].
The pdfbox package contains an executable file ExtractText.exe that
does the trick. Notes in the pdf file do not appear in the txt output.
3. Metadata extraction:
Question: Is it possible/easy to automatically retrieve from
existing resources the metadata record corresponding to a given pdf file?
- Check Citeseer OAI records
(API ).
- Check DBLP (once article is found,
bibtex entry is available. Abstracts accessible through EE link, ACM, IEEE,
Springer formats. DBLP contains list of coauthors/coauthored papers. Useful
for constructing coauthorship networks.)
DBLP search by title gives nothing if a word not present in the title is included
in the query.
Task 1: Download DBLP
XML records and Citeseer
OAI records and index them as one document per paper (e.g. in Lucene).
Search the index using the first few words of the pdf->txt file as the query.
If nothing returned, use fewer words in the query.
If more than one answer, ask a human to pick the correct record.
Task 2: Establish rule that each group member should manually
add metadata into the pdf file properties for each new paper he/she reads
(possible in Acrobat, there may be free programs for doing the same).
Acrobat: File -> Document Properties -> Description
Title, author, keywords (put abstract here),
File -> Document Properties -> Custom
Insert properties (Name, Value) corresponding to bibtex properties.
To automatically extract the properties of a pdf file, a utility based on the
Acrobat
SDK may be available.
Look into the Daffodil project.
They apparently already have metadata extraction facilities from more than one
source (DBLP, Citeseer, plus others).
Check out the Daffodil
Wiki.
- Brief update on new Citeseer project.
- define follow-on project for UG student. Task 1 above is suitable.
4. Term extraction for doc and cluster summarization using KEA or KEA++.
(See also Olena Medelyan's home
page)
Use save-as-text feature of Acrobat or the ExtractText.exe program
of pdfbox to convert a few documents by hand until task #2 is done.
5. Manual cleaning of terms extracted from step #4
6. Construction of term-document matrix, clustering using the Bow
toolkit.
Try LDA and compare results with standard method.
7 . Design simple front end for viewing clustering results
Look into Google
Books for ideas.
Longer-term projects (not in the diagram)
0. Distributed clustering
1. Peer-to-peer document sharing (Hathai)
Relevant Links
Digital Libraries
Citeseer -- Next
Generation Citeseer (CiteseerX)
Google Scholar
Rexa - FAQ
Refs
DBL Browser
Beagle++ : Toolbox: Towards an Extendable
Desktop Search Architecture (see here)
Personal Research literature organizing / sharing
Citeulike
mekentosj.com (facilities for organizing
pdf libraries of scientific articles online)
Semantic Desktop
SemDesk
2006 Semantic Desktop and Social Semantic Collaboration
Nepomuk:
Social Semantic Desktop
Text Mining Software
KEA
Judge
GATE
Infrastructure software
DSpace
Greenstone
MediaWiki (to support collaboration on this project)
Social exchange Web services
Sharing references: http://del.icio.us/
Sharing bookmarks: http://www.citeulike.org/
Sharing bookmarks and references:http://www.bibsonomy.org/
Metadata
Open
Archives Initiative for Metadata Harvesting
Dublin Core Metadata Initiative