skip to main content
10.1145/1065385.1065457acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Resolving the unencoded character problem for chinese digital libraries

Published:07 June 2005Publication History

ABSTRACT

Constructing a Chinese digital library, especially for a historical article archiving, is often bothered by the small character sets supported by the current computer systems. This paper is aimed at resolving the unencoded character problem with a practical and composite approach for Chinese digital libraries. The proposed approach consists of the glyph expression model, the glyph structure database, and supporting tools. With this approach, the following problems can be resolved. First, the extensibility of Chinese characters can be preserved. Second, it would be as easy to generate, input, display, and search unencoded characters as existing ones. Third, it is compatible with existing encoding schemes that most computers use.This approach has been utilized by organizations and projects in various application domains including archeology, linguistics, ancient texts, calligraphy and paintings, and stone and bronze rubbings. For example, in Academia Sinica, a very large full-text database of ancient texts called Scripta Sinica has been created using this approach. The Union Catalog of National Digital Archives Project (NDAP) dealt with the unencoded characters encountered when merging the metadata of 12 different thematic domains from various organizations. Also, in Bronze Inscriptions Research Team (BIRT) of Academia Sinica, 3,459 Bronze Inscriptions were added, which is very helpful to the education and research in historic linguistics.

References

  1. Bishop, T. and Cook, R.S. A Specification for CDL Character Description Language. In Glyph and Typesetting Workshop, Kyoto, Japan, 2003.Google ScholarGoogle Scholar
  2. Cook, R.S. The Extreme of Typographic Complexity: Character Set Issues Relating to Computerization of the Eastern Han Chinese Lexicon Shuowenjiezi. In Proc. of the 18th International Unicode Conference (IUC-18), Apr. 2001.Google ScholarGoogle Scholar
  3. Cook, R.S. Typological Encoding of Chinese: Characters, Not Glyphs. In Proc. of 19th International Unicode Conference (IUC-19), Sep. 2001.Google ScholarGoogle Scholar
  4. Ho, C. W. CHANT (CHinese ANcient Texts): a Comprehensive Database of All Ancient Chinese Texts up to 600 AD, Journal of Digital Information, Volume 3 Issue 2, Article No. 119, Aug. 2002.Google ScholarGoogle Scholar
  5. Hsieh, Ching-Chun. The Missing Character Problem in Electronic Ancient Texts. In the First Conference on Chinese Etymology, Tianjin, Aug. 25-30, 1996. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1996/19960825_1.htmGoogle ScholarGoogle Scholar
  6. Hsieh, Ching-Chun. The Glyph and Encoding in Hanzi - On Redesigning Hanzi Interchange Code -- Part 1. In International Conference on Hanzi Character Code and Database, Kyoto, Oct. 4, 1996. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1996/19961004_1.htmGoogle ScholarGoogle Scholar
  7. Hsieh, Ching-Chun. A Descriptive Method for Re-engineering Hanzi Information Interchange Codes - On Redesigning Hanzi Interchange Code -- Part 2. In International Conference on Hanzi Character Code and Database, Kyoto, Oct. 1996. http://www.sinica.edu.tw/~cdp/paper/1996/19961005_1.htm.Google ScholarGoogle Scholar
  8. Hsieh, Ching-Chun and Lin, Shih. A Survey of Full-text Data Bases and Related Techniques for Chinese Ancient Documents in Academia Sinica, International Journal of Computational Linguistics and Chinese Language Processing, Vol. 2, No. 1, Feb. 1997. (in Chinese) http://rocling.iis.sinica.edu.tw/CLCLP/Vol2-1/a5.htmGoogle ScholarGoogle Scholar
  9. Hsieh, Ching-Chun. On the Formalization and Search of Glyphs in Chinese Ancient Texts. In Conference on Rare Book and Information Technology, Taipei, Apr. 21, 1997. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1997/19970421_1.htmGoogle ScholarGoogle Scholar
  10. Jenkins, J.H. The Dao of Unihan. In Proc. of the 26th International Unicode Conference (IUC-26), Sep. 2004.Google ScholarGoogle Scholar
  11. Juang, Derming, Hsieh, Ching-Chun, and Lin, Shih. On Resolving the Missing Character Problem for Full-text Database for Chinese Ancient Texts in Academia Sinica. In the Second Cross-Strait Symposium on the Rectification of Ancient Texts, Beijing, May 11-13, 1998. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1998/19980511_1.htmGoogle ScholarGoogle Scholar
  12. Lin, S. Research on the Fundamental Chinese Character Set for Computer Use, Technical Report, Department of Computer and Control Engineering, NCTU, March 1972. (in Chinese)Google ScholarGoogle Scholar
  13. Liu, W. The Development of Digital Collections and Metadata Applications in Chinese Libraries. In Proc. of International Symposium on Digital Libraries and Knowledge Communities in Networked Information Society (DLKC 2004), Japan, Mar. 2004.Google ScholarGoogle Scholar
  14. Lu, Q. The Ideographic Composition Scheme and Its Applications in Chinese Text Processing. In Proc. of the 18th International Unicode Conference (IUC-18), Apr. 2001.Google ScholarGoogle Scholar
  15. Lu, Q., Chan, S., Li, Y., and Li, N. Decomposition for ISO/IEC 10646 Ideographic Characters. In the 3rd Workshop on Asian Language Resources and International Standardization, COLING 2002, Taipei, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. OAI (Open Archives Initiative), http://www.openarchives.org/.Google ScholarGoogle Scholar
  17. NDAP, National Digital Archives Program, Academia Sinica (http://www.ndap.org.tw/)Google ScholarGoogle Scholar
  18. Ni, K. Master's Thesis, Institute of Electronics, NCTU, 1972. (in Chinese)Google ScholarGoogle Scholar
  19. Scripta Sinica, Hanji dianzi wenxian, Academia Sinica, http://www.sinica.edu.tw/~tdbproj/handy1/.Google ScholarGoogle Scholar
  20. The Unicode Consortium. The Unicode Standard, Version 4.0.1, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Union Catalog of NDAP, http://catalog.ndap.org.tw/.Google ScholarGoogle Scholar
  22. Wittern, C. Chinese Buddhist texts for the new Millennium - The Chinese Buddhist Electronic Text Association (CBETA) and its Digital Tripitaka. Journal of Digital Information, Volume 3, Issue 2, Article No. 123, Sep. 2002.Google ScholarGoogle Scholar
  23. Wittern, C. and App, U. IRIZ Kanji Base: A New Strategy for Dealing with Missing Chinese Characters. In EBTI (The Electronic Buddhist Text Initiative), Taipei, April 1996.Google ScholarGoogle Scholar
  24. Yang, G. and Zhang, T. The Development of the China Digital Library. Electronic Journal of Academic and Special Librarianship, Vol. 3, No. 3, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Resolving the unencoded character problem for chinese digital libraries

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
            June 2005
            450 pages
            ISBN:1581138768
            DOI:10.1145/1065385
            • General Chair:
            • Mary Marlino,
            • Program Chairs:
            • Tamara Sumner,
            • Frank Shipman

            Copyright © 2005 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 June 2005

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate415of1,482submissions,28%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader