skip to main content
10.1145/2232817.2232823acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Digital preservation and knowledge discovery based on documents from an international health science program

Published:10 June 2012Publication History

ABSTRACT

Important biomedical information is often recorded, published or archived in unstructured and semi-structured textual form. Artificial intelligence and knowledge discovery techniques may be applied to large volumes of such data to identify and extract useful metadata, not only for providing access to these documents, but also for conducting analyses and uncovering patterns and trends in a field. The System for Preservation of Electronic Resources (SPER), an information management tool developed at the U.S. National Library of Medicine, provides these capabilities by integrating machine learning, data mining and digital preservation techniques. In this paper, we present an overview of SPER and its ability to retrieve information from one such dataset. We show how SPER was applied to the semi-structured records of an international health science program, the 46-year continuous archive of conference publications and related documents from the Joint Cholera Panel of the U.S.-Japan Cooperative Medical Science Program (CMSP). We explain the techniques by which metadata was extracted automatically from the semi-structured document contents to preserve these publications, and show how such data was used to quantitatively describe the activity of a research community toward a preliminary study of a subset of its specific health science program goals.

References

  1. Misra, D., Mao, S., Rees, J., Thoma, G.R. 2007. Archiving a Historic Medico-legal Collection: Automation and Workflow Customization, Proc. IS&T Archiving Conference, Washington DC, pg 157--161. (2007).Google ScholarGoogle Scholar
  2. Reference Model for an Open Archival Information System (OAIS). 2002. http://www.ccsds.org/documents/pdf/CCSDS-650.0-R-2.pdfGoogle ScholarGoogle Scholar
  3. Misra, D., Chen, S., Thoma, G.R. 2009. A System for Automated Extraction of Metadata from Scanned Documents Using Layout Recognition and String Pattern Search Models, Proc. IS&T Archiving Conference. Arlin. pg 107--111. (2009).Google ScholarGoogle Scholar
  4. Cortes C., Vapnik V. 1995. Support-vector Network. Machine Learning. Vol. 20, pages 273--297, (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rabiner, L. R., Juang, B. H. 1993. Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall. (1993). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. DSpace, MIT (http://www.dspace.org).Google ScholarGoogle Scholar
  7. The U.S.-Japan Cooperative Medical Science Program (http://www.niaid.nih.gov/topics/globalresearch/usjapan/Pages/history.aspxGoogle ScholarGoogle Scholar
  8. Named Entity Recognition (NER) and Information Extraction (IE) (http://www-nlp.stanford.edu/ner/)Google ScholarGoogle Scholar
  9. Zhang, X., Zou, J., Le, DX., Thoma, G.R. 2010. Investigator Name Recognition From Medical Journal Articles: A Comparative Study of SVM and Structural SVM., International Workshop on Document Analysis Systems, pg 121--128. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. PubMed (http://www.ncbi.nlm.nih.gov/pubmed/).Google ScholarGoogle Scholar

Index Terms

  1. Digital preservation and knowledge discovery based on documents from an international health science program

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
        June 2012
        458 pages
        ISBN:9781450311540
        DOI:10.1145/2232817

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 June 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate415of1,482submissions,28%
      • Article Metrics

        • Downloads (Last 12 months)2
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader