skip to main content
10.1145/1135777.1135898acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

WebKhoj: Indian language IR from multiple character encodings

Published: 23 May 2006 Publication History

Abstract

Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages.Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multi-encoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content.

References

[1]
J. Allan, J. Aslam, N. Belkin, C. Buckley, J. Callan, B. Croft, S. Dumais, N. Fuhr, D. Harman, D. J. Harper, D. Hiemstra, T. Hofmann, E. Hovy, W. Kraaij, J. Lafferty,V.Lavrenko, D. Lewis, L.Liddy, R. Manmatha,A. McCallum, J. Ponte, J. Prager, D.Radev,P. Resnik, S. Robertson, R. Rosenfeld, S. Roukos, M. Sanderson, R. Schwartz, A. Singhal, A. Smeaton, H. Turtle, E. Voorhees, R. Weischedel, J. Xu, and C. Zhai. Challenges in Information Retrieval and Language Modeling: Report of a Workshop held at the Center for Intelligent Information Retrieval,University of Massachusetts Amherst, September 2002. SIGIR Forum 37(1): 31--47, 2003
[2]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke,and S. Raghavan. Searching the Web. ACM Trans. Inter. Tech., 1(1): 2--43, 2001.
[3]
G. B. 14th ed. Ethnologue: Languages of the World. SIL International, Dallas, TX, 2003.
[4]
S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In SIGMOD '95: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data pages 398--409, New York, NY, USA,1995. ACM Press.
[5]
G. E. Burkhart, S. E. Goodman, A. Mehta, and L. Press. The Internet in India: Better times ahead? Commun. ACM 41(11): 21--26, 1998.
[6]
S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated Focused Crawling through Online Relevance Feedback. In WWW '02: Proceedings of the 11th International Conference on World Wide Web pages 148--159, New York, NY, USA, 2002. ACM Press.
[7]
F. Gey, N. Kando, and C. Peters. Cross Language Information Retrieval: A Research Roadmap. SIGIR Forum 36(2): 72--80, 2002.
[8]
Y. Haralambous and G. Bella. Injecting Information into Atomic Units of Text. In DocEng '05: Proceedings of the 2005 ACM Symposium on Document Engineering pages 134--142, New York, NY, USA, 2005. ACM Press.
[9]
A. Joshi, A. Ganu, A. Chand, V. Parmar, and G. Mathur. Keylekh: a Keyboard for Text Entry in Indic Scripts. In CHI '04: CHI '04 Extended Abstracts on Human Factors in Computing Systems pages 928--942, New York, NY, USA, 2004. ACM Press.
[10]
L. S. Larkey, M. E. Connell, and N. Abduljaleel. Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing (TALIP) 2(2):130--142,2003.
[11]
D. P. Madalli. Unicode for Multilingual Representation in Digital Libraries from the Indian Perspective. In JCDL '02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries pages 398--398, New York, NY, USA, 2002. ACM Press.
[12]
P. Pingali and V. Varma. Word Normalization in Indian Languages. In ICON05: Proceedings of the 2005 International Conference on Natural Language Processing 2005.
[13]
G. Salton and C. Buckley. Term-weighting Approaches in Automatic Text Retrieval. Information Process. Management 24(5): 513--523, 1988.
[14]
S. Strassel, M. Maxwell, and C. Cieri. Linguistic Resource Creation for Research and Technology Development: A Recent Experiment. ACM Transactions on Asian Language Information Processing (TALIP)2(2): 101--117, 2003.
[15]
F. Yergeau. UTF-8, a transformation format of ISO 10646 RFC Editor, United States, 2003.

Cited By

View all
  • (2021)Information Retrieval Based on Telugu Cross-Language TransliterationMachine Learning and Information Processing10.1007/978-981-33-4859-2_34(343-350)Online publication date: 3-Apr-2021
  • (2013)A Comprehensive Study of Techniques for URL-Based Web Page Language ClassificationACM Transactions on the Web10.1145/2435215.24352187:1(1-37)Online publication date: 1-Mar-2013
  • (2013)Language identification and correction in corrupted texts of regional Indian languages2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)10.1109/ICSDA.2013.6709877(1-5)Online publication date: Nov-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Indian languages
  2. non-standard encodings
  3. web search

Qualifiers

  • Article

Conference

WWW06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Information Retrieval Based on Telugu Cross-Language TransliterationMachine Learning and Information Processing10.1007/978-981-33-4859-2_34(343-350)Online publication date: 3-Apr-2021
  • (2013)A Comprehensive Study of Techniques for URL-Based Web Page Language ClassificationACM Transactions on the Web10.1145/2435215.24352187:1(1-37)Online publication date: 1-Mar-2013
  • (2013)Language identification and correction in corrupted texts of regional Indian languages2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)10.1109/ICSDA.2013.6709877(1-5)Online publication date: Nov-2013
  • (2012)Domain specific search in indian languagesProceedings of the first workshop on Information and knowledge management for developing region10.1145/2389776.2389782(23-30)Online publication date: 2-Nov-2012
  • (2011)Applications of Multilingual Information AccessMultilingual Information Retrieval10.1007/978-3-642-23008-0_6(171-207)Online publication date: 28-Sep-2011
  • (2010)Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IRACM Transactions on Asian Language Information Processing10.1145/1838745.18387499:3(1-30)Online publication date: 1-Sep-2010
  • (2010)The FIRE 2008 Evaluation ExerciseACM Transactions on Asian Language Information Processing10.1145/1838745.18387479:3(1-24)Online publication date: 1-Sep-2010
  • (2009)Design and implementation-algorithms of Amharic search engine system for Amharic web contentsProceedings of the 3rd international conference on New technologies, mobility and security10.5555/1790343.1790361(89-94)Online publication date: 20-Dec-2009
  • (2009)Transliteration based search engine for multilingual information accessProceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies10.5555/1572433.1572436(12-20)Online publication date: 4-Jun-2009
  • (2009)Design and Implementation of Amharic Search EngineProceedings of the 2009 Fifth International Conference on Signal Image Technology and Internet Based Systems10.1109/SITIS.2009.58(318-325)Online publication date: 29-Nov-2009
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media