skip to main content
10.1145/1376616.1376701acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Pay-as-you-go user feedback for dataspace systems

Published: 09 June 2008 Publication History

Abstract

A primary challenge to large-scale data integration is creating semantic equivalences between elements from different data sources that correspond to the same real-world entity or concept. Dataspaces propose a pay-as-you-go approach: automated mechanisms such as schema matching and reference reconciliation provide initial correspondences, termed candidate matches, and then user feedback is used to incrementally confirm these matches. The key to this approach is to determine in what order to solicit user feedback for confirming candidate matches.
In this paper, we develop a decision-theoretic framework for ordering candidate matches for user confirmation using the concept of the value of perfect information (VPI). At the core of this concept is a utility function that quantifies the desirability of a given state; thus, we devise a utility function for dataspaces based on query result quality. We show in practice how to efficiently apply VPI in concert with this utility function to order user confirmations. A detailed experimental evaluation on both real and synthetic datasets shows that the ordering of user feedback produced by this VPI-based approach yields a dataspace with a significantly higher utility than a wide range of other ordering strategies. Finally, we outline the design of Roomba, a system that utilizes this decision-theoretic framework to guide a dataspace in soliciting user feedback in a pay-as-you-go manner.

References

[1]
Omar Benjelloun, Hector Garcia-Molina, Hideki Kawai, Tait Eliott Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon, and Jennifer Widom. Generic entity resolution in the serf project. IEEE Data Eng. Bull., 29(2):13--20, 2006.
[2]
George Casella and Roger Berger. Statistical Inference. Duxbury, 2002.
[3]
Gretchen B. (Editor) Chapman and Frank A. (Editor) Sonnenberg. Decision Making in Health Care: Theory, Psychology, and Applications. Cambridge University Press; New Ed edition (September 1, 2003), 2003.
[4]
Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. Effective use of block-level sampling in statistics estimation. In SIGMOD ?04, 2004.
[5]
Francis Chu, Joseph Y. Halpern, and Praveen Seshadri. Least expected cost query optimization: an exercise in utility. In PODS ?99, 1999.
[6]
Mark Claypool, Phong Le, Makoto Wased, and David Brown. Implicit interest indicators. In Intelligent User Interfaces, pages 33--40, 2001.
[7]
W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI, 2003.
[8]
AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. In SIGMOD ?01, 2001.
[9]
AnHai Doan and Alon Y. Halevy. Semantic-integration research in the database community. AI Mag., 26(1):83--94, 2005.
[10]
AnHai Doan, Raghu Ramakrishnan, Fei Chen, Pedro DeRose, Yoonkyong Lee, Robert McCann, Mayssam Sayyadian, and Warren Shen. Community information management. IEEE Data Eng. Bull., 29(1):64--72, 2006.
[11]
Flickr. http://www.flickr.com.
[12]
Mike Franklin, Alon Halevy, and David Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, 34(4):27--33, 2005.
[13]
Google Base. http://base.google.com.
[14]
Eric Horvitz, Carl Kadie, Tim Paek, and David Hovel. Models of attention in computing and communication: from principles to applications. Commun. ACM, 46(3):52--59, 2003.
[15]
Yannis E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003.
[16]
Yannis E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003.
[17]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately interpreting clickthrough data as implicit feedback. In SIGIR ?05, 2005.
[18]
Ashish Kapoor, Eric Horvitz, and Sumit Basu. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In IJCAI, pages 877--882, 2007.
[19]
Jayant Madhavan, Alon Y. Halevy, Shirley Cohen, Xin Luna Dong, Shawn R. Jeffery, David Ko, and Cong Yu. Structured data meets the web: A few observations. IEEE Data Eng. Bull., 29(4):19--26, 2006.
[20]
Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, and Alon Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, 2007.
[21]
Andreu Mas-Colell, Michael D. Whinston, and Jerry R. Green. Microeconomic Theory. Oxford, 1995.
[22]
Robert McCann, AnHai Doan, Vanitha Varadaran, Alexander Kramnik, and ChengXiang Zhai. Building data integration systems: A mass collaboration approach. In WebDB, 2003.
[23]
Oskar Morgenstern and John Von Neumann. Theory of Games and Economic Behavior. Princeton University Press, 1944.
[24]
F. Radlinski and T. Joachims. Query chains: Learning to rank from implicit feedback. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2005.
[25]
Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. VLDBJ, 10(4):334--350, 2001.
[26]
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition edition, 2003.
[27]
Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In KDD ?02, 2002.
[28]
Anish Das Sarma, Luna Dong, and Alon Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD ?08, 2008.
[29]
SecondString Project Page. http://secondstring.sourceforge.net/.
[30]
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical Report 1998-014, Digital SRC, 1998. http://gatekeeper.dec.com/pub/DEC/SRC/technicalnotes/abstracts/src-tn-1998-014.html.
[31]
J. Surowiecki. The wisdom of crowds. Doubleday, 2004.
[32]
The Large Hadron Collider. http://lhc.web.cern.ch/lhc/.
[33]
Gilman Tolle, Joseph Polastre, Robert Szewczyk, David E. Culler, Neil Turner, Kevin Tu, Stephen Burgess, Todd Dawson, Phil Buonadonna, David Gay, and Wei Hong. A Macroscope in the Redwoods. In SenSys, pages 51--63, 2005.
[34]
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979.
[35]
Luis von Ahn and Laura Dabbish. Labeling Images with a Computer Game. In ACM CHI, 2004.
[36]
William E. Winkler. The state of record linkage and current research problems. Technical Report Statistical Research Report Series RR99/04, U.S. Bureau of the Census, Washington, D.C., 1999.
[37]
Wensheng Wu, Clement Yu, AnHai Doan, and Weiyi Meng. An interactive clustering-based approach to integrating source query interfaces on the deep web. In SIGMOD ?04, 2004.

Cited By

View all
  • (2024)End-to-end pseudo relevance feedback based vertical web search queries recommendationMultimedia Tools and Applications10.1007/s11042-024-18559-4Online publication date: 21-Feb-2024
  • (2024)Situational Data Integration in Question Answering systems: a survey over two decadesKnowledge and Information Systems10.1007/s10115-024-02136-066:10(5875-5918)Online publication date: 18-Jun-2024
  • (2023)CQFaRAD: Collaborative Query-Answering Framework for a Research Article DataspaceInternational Journal of Information Technology10.1007/s41870-023-01518-xOnline publication date: 30-Sep-2023
  • Show More Cited By

Index Terms

  1. Pay-as-you-go user feedback for dataspace systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
    June 2008
    1396 pages
    ISBN:9781605581026
    DOI:10.1145/1376616
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data integration
    2. dataspace
    3. decision theory
    4. user feedback

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)End-to-end pseudo relevance feedback based vertical web search queries recommendationMultimedia Tools and Applications10.1007/s11042-024-18559-4Online publication date: 21-Feb-2024
    • (2024)Situational Data Integration in Question Answering systems: a survey over two decadesKnowledge and Information Systems10.1007/s10115-024-02136-066:10(5875-5918)Online publication date: 18-Jun-2024
    • (2023)CQFaRAD: Collaborative Query-Answering Framework for a Research Article DataspaceInternational Journal of Information Technology10.1007/s41870-023-01518-xOnline publication date: 30-Sep-2023
    • (2023)Crowdsourcing of labeling image objects: an online gamification application for data collectionMultimedia Tools and Applications10.1007/s11042-023-16325-683:7(20827-20860)Online publication date: 4-Aug-2023
    • (2023)Data Driven Smart Cities and Data SpacesNew Metropolitan Perspectives10.1007/978-3-031-34211-0_18(378-388)Online publication date: 30-May-2023
    • (2022)A Sustainable Solution for IoT Semantic Interoperability: Dataspaces Model via Distributed ApproachesIEEE Internet of Things Journal10.1109/JIOT.2021.30970689:10(7228-7242)Online publication date: 15-May-2022
    • (2022)Progressive Entity Matching via Cost Benefit AnalysisIEEE Access10.1109/ACCESS.2021.313998710(3979-3989)Online publication date: 2022
    • (2022)TIKD: A Trusted Integrated Knowledge Dataspace for Sensitive Data Sharing and CollaborationData Spaces10.1007/978-3-030-98636-0_13(265-291)Online publication date: 11-Mar-2022
    • (2021)TIKD: A Trusted Integrated Knowledge Dataspace For Sensitive Healthcare Data Sharing2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC51774.2021.00280(1855-1860)Online publication date: Jul-2021
    • (2021)Industrial Dataspace for smart manufacturing: connotation, key technologies, and frameworkInternational Journal of Production Research10.1080/00207543.2021.195599661:12(3868-3883)Online publication date: 16-Aug-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media