ACM Home Page
Please provide us with feedback. Feedback
A metadata generation system for scanned scientific volumes
Full text PdfPdf (1.08 MB)
Source
International Conference on Digital Libraries archive
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries table of contents
Pittsburgh PA, PA, USA
SESSION: It's the metadata, stupid table of contents
Pages 167-176  
Year of Publication: 2008
ISBN:978-1-59593-998-2
Authors
Xiaonan Lu  The Pennsylvania State University, University Park, PA, USA
Brewster Kahle  Internet Archive, San Francisco, CA, USA
James Z. Wang  The Pennsylvania State University, University Park, PA, USA
C. Lee Giles  The Pennsylvania State University, University Park, PA, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 73,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1378889.1378918
What is a DOI?

ABSTRACT

Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Biodiversity Heritage Library. http://www.biodiversitylibrary.org.
 
2
Djvu Zone. http://www.djvuzone.org.
 
3
Dublin Core Metadata Initiative. http://dublincore.org.
 
4
Gem. http://www.thegateway.org.
 
5
Google Book Search. http://books.google.com.
 
6
Internet Archive. http://www.archive.org.
 
7
Smithsonian Institute. http://www.si.edu.
 
8
The Universal Digital Library. In http://www.ulib.org.
 
9
W. Y. Arms, C. Blanchi, and E. A. Overly. An Architecture for Information in Digital Libraries. 1997.
 
10
H. Besser. The Next Stage: Moving from Isolated Digtal Collections to Interoperable Digital Libraries. http://www.firstmonday.dk/issues/issue7_6/besser. 2002.
 
11
F. Cesarini, M. Lastri, S. Marinai, and G. Soda. Page classification for meta-data extraction from digital collections. In Database and Expert Systems Applications, pages 82--91, 2001.
 
12
N. Dushay. Localizing experience of digital content via structural metadata. In Proceedings of the ACM/IEEE joint conference on Digital libraries, pages 244--252, New York, NY, USA, 2002. ACM.
 
13
C. L. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing system. In Proceedings of the International Conference on Digital Libraries, pages 89--98, 1998.
 
14
G. Giuffrida, E. C. Shek, and J. Yang. Knowledge-based metadata extraction from postscript files. In Proceedings of the International Conference on Digital Libraries, pages 77--84, 2000.
 
15
H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In JCDL '03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pages 37--48, 2003.
 
16
Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. In JCDL '05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 145--154, 2005.
 
17
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. In Intelligent Data Analysis, pages 429--449, 2002.
 
18
T. Joachims. Making large-scale svm learning practical. In Advances in Kernel Methods - Support Vector Learning, 1999.
 
19
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In International Conference on Machine Learning, pages 179--186, 1997.
 
20
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, pages 707--710, 1966.
 
21
Y. Liu, E. Shriberg, A. Stolcke, B. Peskin, J. Ang, D. Hillard, M. Ostendorf, M. Tomalin, P. Woodland, and M. Harper. Structural metadata research in the ears program. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 957--960, 2005.
 
22
S. Mao, J. W. Kim, and G. R. Thoma. A dynamic feature generation system for automated metadata extraction in preservation of digital materials. In Proceedings of the International Workshop on Document Image Analysis for Libraries, pages 225--232, Washington, DC, USA, 2004. IEEE Computer Society.
 
23
Y. Petinot, C. Giles, V. Bhatnagar, P. Teregowda, H. Han, and I. Councill. A service-oriented architecture for digital libraries. In International Conference on Service Oriented Computing, pages 263--268, 2004.
 
24
R. Wendler. LDI Update: Metadata in the Library. Library Notes, no. 1286 (July/August): 4--5, 1999.

Collaborative Colleagues:
Xiaonan Lu: colleagues
Brewster Kahle: colleagues
James Z. Wang: colleagues
C. Lee Giles: colleagues