ABSTRACT
Important biomedical information is often recorded, published or archived in unstructured and semi-structured textual form. Artificial intelligence and knowledge discovery techniques may be applied to large volumes of such data to identify and extract useful metadata, not only for providing access to these documents, but also for conducting analyses and uncovering patterns and trends in a field. The System for Preservation of Electronic Resources (SPER), an information management tool developed at the U.S. National Library of Medicine, provides these capabilities by integrating machine learning, data mining and digital preservation techniques. In this paper, we present an overview of SPER and its ability to retrieve information from one such dataset. We show how SPER was applied to the semi-structured records of an international health science program, the 46-year continuous archive of conference publications and related documents from the Joint Cholera Panel of the U.S.-Japan Cooperative Medical Science Program (CMSP). We explain the techniques by which metadata was extracted automatically from the semi-structured document contents to preserve these publications, and show how such data was used to quantitatively describe the activity of a research community toward a preliminary study of a subset of its specific health science program goals.
- Misra, D., Mao, S., Rees, J., Thoma, G.R. 2007. Archiving a Historic Medico-legal Collection: Automation and Workflow Customization, Proc. IS&T Archiving Conference, Washington DC, pg 157--161. (2007).Google Scholar
- Reference Model for an Open Archival Information System (OAIS). 2002. http://www.ccsds.org/documents/pdf/CCSDS-650.0-R-2.pdfGoogle Scholar
- Misra, D., Chen, S., Thoma, G.R. 2009. A System for Automated Extraction of Metadata from Scanned Documents Using Layout Recognition and String Pattern Search Models, Proc. IS&T Archiving Conference. Arlin. pg 107--111. (2009).Google Scholar
- Cortes C., Vapnik V. 1995. Support-vector Network. Machine Learning. Vol. 20, pages 273--297, (1995). Google ScholarDigital Library
- Rabiner, L. R., Juang, B. H. 1993. Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall. (1993). Google ScholarDigital Library
- DSpace, MIT (http://www.dspace.org).Google Scholar
- The U.S.-Japan Cooperative Medical Science Program (http://www.niaid.nih.gov/topics/globalresearch/usjapan/Pages/history.aspxGoogle Scholar
- Named Entity Recognition (NER) and Information Extraction (IE) (http://www-nlp.stanford.edu/ner/)Google Scholar
- Zhang, X., Zou, J., Le, DX., Thoma, G.R. 2010. Investigator Name Recognition From Medical Journal Articles: A Comparative Study of SVM and Structural SVM., International Workshop on Document Analysis Systems, pg 121--128. (2010). Google ScholarDigital Library
- PubMed (http://www.ncbi.nlm.nih.gov/pubmed/).Google Scholar
Index Terms
- Digital preservation and knowledge discovery based on documents from an international health science program
Recommendations
Digital preservation in a box: outreach resources for digital stewardship
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries"Digital Preservation in a Box" is a major activity of the National Digital Stewardship Alliance (NDSA) Outreach Working Group. This toolkit of digital stewardship outreach resources can be utilized by diverse communities as a gentle introduction to the ...
The Florida Digital Archive and DAITSS: a working preservation repository based on format migration
The Florida Digital Archive is a long-term digital preservation repository for the use of the libraries of the public universities of Florida. It is managed by the Florida Center for Library Automation (FCLA) and based on Dark Archive in the Sunshine ...
Preservation functionality in a digital archive
JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital librariesEarly 2003 the digital archiving system of the National Library of the Netherlands (KB) was taken into production. This system is called the e-Depot and its technical heart is the IBM system called Digital Information Archiving System (DIAS). The e-...
Comments