skip to main content
10.1145/1871437.1871651acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Alignment of short length parallel corpora with an application to web search

Published:26 October 2010Publication History

ABSTRACT

With evolving Web, short length parallel corpora is becoming very common and some of these include user queries, web snippets etc. This paper concerns situations where short length parallel corpora has to be analyzed in order to find meaningful unit-alignment. This is similar to dealing with parallel corpora where a sentence level alignment of translations is required, but differs in that the alignment is to be inferred at unit (word or phrase) level. A Conditional Random Field (CRF) based approach is proposed to discover this unit alignment. Given pairs of semantically or syntactically similar entities, the problem is formulated as that of mutual segmentation and sequence alignment problem. The mutual segmentation refers to the process of segmenting the first entity based on units (or labels) in the second entity and vice-versa. The process of optimizing this mutual segmentation also results in optimal unit alignment. Since our training data is not segmented and unit-aligned, we modify the CRF objective function to accommodate unsupervised data and iterative learning. We have applied this framework to Web Search domain and specifically for query reformulation task. Finally, our experiments suggest that the proposed approach indeed results in meaningful alternatives of the original query.

References

  1. Yahoo! search. http://search.yahoo.com/.Google ScholarGoogle Scholar
  2. A. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, D. Metzler, L. Riedel, and J. Yuan. Online expansion of rare queries for sponsored search. In WWW '09: Proceedings of the 18th international conference on World wide web, pages 511--520, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Chen and J. yun Nie. Parallel web text mining for cross-language IR. In In Proc. of RIAO, pages 62--77, 2000.Google ScholarGoogle Scholar
  4. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 387--396, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Li, Y.-Y. Wang, and A. Acero. Extracting structured information from user queries with semi-supervised conditional random fields. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 572--579, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. X. Wang and C. Zhai. Mining term association patterns from search logs for effective query reformulation. In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 479--488, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Alignment of short length parallel corpora with an application to web search

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
        October 2010
        2036 pages
        ISBN:9781450300995
        DOI:10.1145/1871437

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      • Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader