ABSTRACT
Constructing a Chinese digital library, especially for a historical article archiving, is often bothered by the small character sets supported by the current computer systems. This paper is aimed at resolving the unencoded character problem with a practical and composite approach for Chinese digital libraries. The proposed approach consists of the glyph expression model, the glyph structure database, and supporting tools. With this approach, the following problems can be resolved. First, the extensibility of Chinese characters can be preserved. Second, it would be as easy to generate, input, display, and search unencoded characters as existing ones. Third, it is compatible with existing encoding schemes that most computers use.This approach has been utilized by organizations and projects in various application domains including archeology, linguistics, ancient texts, calligraphy and paintings, and stone and bronze rubbings. For example, in Academia Sinica, a very large full-text database of ancient texts called Scripta Sinica has been created using this approach. The Union Catalog of National Digital Archives Project (NDAP) dealt with the unencoded characters encountered when merging the metadata of 12 different thematic domains from various organizations. Also, in Bronze Inscriptions Research Team (BIRT) of Academia Sinica, 3,459 Bronze Inscriptions were added, which is very helpful to the education and research in historic linguistics.
- Bishop, T. and Cook, R.S. A Specification for CDL Character Description Language. In Glyph and Typesetting Workshop, Kyoto, Japan, 2003.Google Scholar
- Cook, R.S. The Extreme of Typographic Complexity: Character Set Issues Relating to Computerization of the Eastern Han Chinese Lexicon Shuowenjiezi. In Proc. of the 18th International Unicode Conference (IUC-18), Apr. 2001.Google Scholar
- Cook, R.S. Typological Encoding of Chinese: Characters, Not Glyphs. In Proc. of 19th International Unicode Conference (IUC-19), Sep. 2001.Google Scholar
- Ho, C. W. CHANT (CHinese ANcient Texts): a Comprehensive Database of All Ancient Chinese Texts up to 600 AD, Journal of Digital Information, Volume 3 Issue 2, Article No. 119, Aug. 2002.Google Scholar
- Hsieh, Ching-Chun. The Missing Character Problem in Electronic Ancient Texts. In the First Conference on Chinese Etymology, Tianjin, Aug. 25-30, 1996. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1996/19960825_1.htmGoogle Scholar
- Hsieh, Ching-Chun. The Glyph and Encoding in Hanzi - On Redesigning Hanzi Interchange Code -- Part 1. In International Conference on Hanzi Character Code and Database, Kyoto, Oct. 4, 1996. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1996/19961004_1.htmGoogle Scholar
- Hsieh, Ching-Chun. A Descriptive Method for Re-engineering Hanzi Information Interchange Codes - On Redesigning Hanzi Interchange Code -- Part 2. In International Conference on Hanzi Character Code and Database, Kyoto, Oct. 1996. http://www.sinica.edu.tw/~cdp/paper/1996/19961005_1.htm.Google Scholar
- Hsieh, Ching-Chun and Lin, Shih. A Survey of Full-text Data Bases and Related Techniques for Chinese Ancient Documents in Academia Sinica, International Journal of Computational Linguistics and Chinese Language Processing, Vol. 2, No. 1, Feb. 1997. (in Chinese) http://rocling.iis.sinica.edu.tw/CLCLP/Vol2-1/a5.htmGoogle Scholar
- Hsieh, Ching-Chun. On the Formalization and Search of Glyphs in Chinese Ancient Texts. In Conference on Rare Book and Information Technology, Taipei, Apr. 21, 1997. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1997/19970421_1.htmGoogle Scholar
- Jenkins, J.H. The Dao of Unihan. In Proc. of the 26th International Unicode Conference (IUC-26), Sep. 2004.Google Scholar
- Juang, Derming, Hsieh, Ching-Chun, and Lin, Shih. On Resolving the Missing Character Problem for Full-text Database for Chinese Ancient Texts in Academia Sinica. In the Second Cross-Strait Symposium on the Rectification of Ancient Texts, Beijing, May 11-13, 1998. (in Chinese) http://www.sinica.edu.tw/~cdp/paper/1998/19980511_1.htmGoogle Scholar
- Lin, S. Research on the Fundamental Chinese Character Set for Computer Use, Technical Report, Department of Computer and Control Engineering, NCTU, March 1972. (in Chinese)Google Scholar
- Liu, W. The Development of Digital Collections and Metadata Applications in Chinese Libraries. In Proc. of International Symposium on Digital Libraries and Knowledge Communities in Networked Information Society (DLKC 2004), Japan, Mar. 2004.Google Scholar
- Lu, Q. The Ideographic Composition Scheme and Its Applications in Chinese Text Processing. In Proc. of the 18th International Unicode Conference (IUC-18), Apr. 2001.Google Scholar
- Lu, Q., Chan, S., Li, Y., and Li, N. Decomposition for ISO/IEC 10646 Ideographic Characters. In the 3rd Workshop on Asian Language Resources and International Standardization, COLING 2002, Taipei, 2002. Google ScholarDigital Library
- OAI (Open Archives Initiative), http://www.openarchives.org/.Google Scholar
- NDAP, National Digital Archives Program, Academia Sinica (http://www.ndap.org.tw/)Google Scholar
- Ni, K. Master's Thesis, Institute of Electronics, NCTU, 1972. (in Chinese)Google Scholar
- Scripta Sinica, Hanji dianzi wenxian, Academia Sinica, http://www.sinica.edu.tw/~tdbproj/handy1/.Google Scholar
- The Unicode Consortium. The Unicode Standard, Version 4.0.1, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1. Google ScholarDigital Library
- Union Catalog of NDAP, http://catalog.ndap.org.tw/.Google Scholar
- Wittern, C. Chinese Buddhist texts for the new Millennium - The Chinese Buddhist Electronic Text Association (CBETA) and its Digital Tripitaka. Journal of Digital Information, Volume 3, Issue 2, Article No. 123, Sep. 2002.Google Scholar
- Wittern, C. and App, U. IRIZ Kanji Base: A New Strategy for Dealing with Missing Chinese Characters. In EBTI (The Electronic Buddhist Text Initiative), Taipei, April 1996.Google Scholar
- Yang, G. and Zhang, T. The Development of the China Digital Library. Electronic Journal of Academic and Special Librarianship, Vol. 3, No. 3, 2002.Google Scholar
Index Terms
Resolving the unencoded character problem for chinese digital libraries
Recommendations
A Mechanism for Solving the Unencoded Chinese Character Problem on the Web
ECDL '08: Proceedings of the 12th European conference on Research and Advanced Technology for Digital LibrariesThe unencoded Chinese character problem that occurs when digitizing historical Chinese documents makes digital archiving difficult. Expanding the character coding space, such as by using the Unicode Standard, does not solve the problem completely due ...
Semantic-Based Handwritten Chinese Character Recognition Model
ICCMS '10: Proceedings of the 2010 Second International Conference on Computer Modeling and Simulation - Volume 01There have been many different literals discussing algorithms for handwritten Chinese character recognition, but most algorithms aim at recognizing isolated Chinese character one by one. Therefore, their recognition accuracy isn’t good enough for the ...
LSH-based large scale chinese calligraphic character recognition
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesChinese calligraphy is the art of handwriting and is an important part of Chinese traditional culture. But due to the complexity of shape and styles of calligraphic characters, it is difficult for com-mon people to recognize them. So it would be great ...
Comments