poster

Alignment of short length parallel corpora with an application to web search

Authors:
Jitendra Ajmera

IBM India Research Lab, New Delhi, India

IBM India Research Lab, New Delhi, India
View Profile

,
Hema Swetha Koppula

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Krishna P. Leela

Yahoo! Labs, Bangalore, India

Yahoo! Labs, Bangalore, India
View Profile

,
Shibnath Mukherjee

Yahoo! Labs, Bangalore, India

Yahoo! Labs, Bangalore, India
View Profile

,
Mehul Parsana

Yahoo! Labs, Bangalore, India

Yahoo! Labs, Bangalore, India
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 1477–1480https://doi.org/10.1145/1871437.1871651

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 1477–1480

ABSTRACT

With evolving Web, short length parallel corpora is becoming very common and some of these include user queries, web snippets etc. This paper concerns situations where short length parallel corpora has to be analyzed in order to find meaningful unit-alignment. This is similar to dealing with parallel corpora where a sentence level alignment of translations is required, but differs in that the alignment is to be inferred at unit (word or phrase) level. A Conditional Random Field (CRF) based approach is proposed to discover this unit alignment. Given pairs of semantically or syntactically similar entities, the problem is formulated as that of mutual segmentation and sequence alignment problem. The mutual segmentation refers to the process of segmenting the first entity based on units (or labels) in the second entity and vice-versa. The process of optimizing this mutual segmentation also results in optimal unit alignment. Since our training data is not segmented and unit-aligned, we modify the CRF objective function to accommodate unsupervised data and iterative learning. We have applied this framework to Web Search domain and specifically for query reformulation task. Finally, our experiments suggest that the proposed approach indeed results in meaningful alternatives of the original query.

References

Yahoo! search. http://search.yahoo.com/.Google Scholar
A. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, D. Metzler, L. Riedel, and J. Yuan. Online expansion of rare queries for sponsored search. In WWW '09: Proceedings of the 18th international conference on World wide web, pages 511--520, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Chen and J. yun Nie. Parallel web text mining for cross-language IR. In In Proc. of RIAO, pages 62--77, 2000.Google Scholar
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 387--396, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pages 282--289, 2001. Google ScholarDigital Library
X. Li, Y.-Y. Wang, and A. Acero. Extracting structured information from user queries with semi-supervised conditional random fields. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 572--579, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
X. Wang and C. Zhai. Mining term association patterns from search logs for effective query reformulation. In CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, pages 479--488, New York, NY, USA, 2008. ACM. Google ScholarDigital Library

Index Terms

Alignment of short length parallel corpora with an application to web search
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

An unsupervised method for multilingual word sense tagging using parallel corpora: a preliminary investigation
WWSM '00: Proceedings of the ACL-2000 workshop on Word senses and multi-linguality - Volume 8

With an increasing number of languages making their way to our desktops everyday via the Internet, researchers have come to realize the lack of linguistic knowledge resources for scarcely represented/studied languages. In an attempt to bootstrap some of ...
Read More
Translating medical terminologies through word alignment in parallel text corpora

Developing international multilingual terminologies is a time-consuming process. We present a methodology which aims to ease this process by automatically acquiring new translations of medical terms based on word alignment in parallel text corpora, and ...
Read More
Multilingual sentence alignment from Wikipedia as multilingual comparable corpora
HC '10: Proceedings of the 13th International Conference on Humans and Computers

Bilingual dictionaries and the multilingual dictionaries are necessary resources for machine translation and cross language information retrieval. With the help of these dictionaries, an information retrieval system can find documents of similar content ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mutual segmentation
parallel corpora
query reformulation
sequence alignment
web search
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 130
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Alignment of short length parallel corpora with an application to web search

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An unsupervised method for multilingual word sense tagging using parallel corpora: a preliminary investigation

Translating medical terminologies through word alignment in parallel text corpora

Multilingual sentence alignment from Wikipedia as multilingual comparable corpora