skip to main content
10.1145/1376616.1376702acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Bootstrapping pay-as-you-go data integration systems

Published: 09 June 2008 Publication History

Abstract

Data integration systems offer a uniform interface to a set of data sources. Despite recent progress, setting up and maintaining a data integration application still requires significant upfront effort of creating a mediated schema and semantic mappings from the data sources to the mediated schema. Many application contexts involving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-go approach to integration. With that approach, a system starts with very few (or inaccurate) semantic mappings and these mappings are improved over time as deemed necessary.
This paper describes the first completely self-configuring data integration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system. Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources. We automatically create probabilistic schema mappings between the sources and the mediated schema. We describe experiments in multiple domains, including 50-800 data sources, and show that our system is able to produce high-quality answers with no human intervention.

References

[1]
Knitro optimization software. http://www.ziena.com/knitro.htm.
[2]
Secondstring. http://secondstring.sourceforge.net/.
[3]
C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies for database schema integration. In ACM Computing Surveys, pages 323--364, 1986.
[4]
A. L. Berger, S. A. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (1):39--71, 1996.
[5]
J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proc. of the 14th Int. Conf. on Advanced Information Systems Eng. (CAiSE02), 2002.
[6]
P. Buneman, S. Davidson, and A. Kosky. Theoretical aspects of schema merging. In Proc. of EDBT, 1992.
[7]
R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex semantic matches between database schemas. In Proc. of ACM SIGMOD, 2004.
[8]
H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In Proc. of VLDB, 2002.
[9]
A. Doan, J. Madhavan, P. Domingos, and A. Y. Halevy. Learning to map between ontologies on the Semantic Web. In Proc. of the Int. WWW Conf., 2002.
[10]
X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In Proc. of VLDB, 2007.
[11]
M. Dudik, S. J. Phillips, and R. E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. of the 17th Annual Conf. on Computational Learning Theory, 2004.
[12]
M. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. In SIGMOD Record, pages 27--33, 2005.
[13]
A. Gal. Why is schema matching tough and what can we do about it? SIGMOD Record, 35(4):2--5, 2007.
[14]
B. He and K. C. Chang. Statistical schema matching across web query interfaces. In Proc. of ACM SIGMOD, 2003.
[15]
R. Hull. Relative information capacity of simple relational database schemata. In Proc. of ACM PODS, 1984.
[16]
S. Jeffery, M. Franklin, and A. Halevy. Pay-as-you-go user feedback for dataspace systems. In Proc. of ACM SIGMOD, 2008.
[17]
L. A. Kalinichenko. Methods and tools for equivalent data model mapping construction. In Proc. of EDBT, 1990.
[18]
J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of ACM SIGMOD, 2003.
[19]
M. Magnani and D. Montesi. Uncertainty in data integration: current approaches and open problems. In VLDB workshop on Management of Uncertain Data, pages 18--32, 2007.
[20]
M. Magnani, N. Rizopoulos, P. Brien, and D. Montesi. Schema integration based on uncertain semantic mappings. Lecture Notes in Computer Science, pages 31--46, 2005.
[21]
S. Melnik, H. G. Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm. In Proc. of ICDE, pages 117--128, 2002.
[22]
R. J. Miller, Y. Ioannidis, and R. Ramakrishnan. The use of information capacity in schema integration and translation. In Proc. of VLDB, 1993.
[23]
H. Nottelmann and U. Straccia. Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management, 43(3):552--576, 2007.
[24]
S. D. Pietra, V. D. Pietra, and J. Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380--393, 1997.
[25]
R. Pottinger and P. Bernstein. Creating a mediated schema based on initial correspondences. In IEEE Data Eng. Bulletin, pages 26--31, Sept 2002.
[26]
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001.
[27]
S. E. Fienberg W. Cohen, P. Ravikumar. A comparison of string distance metrics for name-matching tasks. In Proc. of IJCAI, 2003.
[28]
J. Wang, J. Wen, F. H. Lochovsky, and W. Ma. Instance-based schema matching for Web databases by domain-specific query probing. In Proc. of VLDB, 2004.

Cited By

View all
  • (2024)Identity-Based Secure Key-Deduplication Broadcast Encryption2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568725(1-6)Online publication date: 4-Apr-2024
  • (2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
  • (2023)Linked Data - The Story So FarLinking the World’s Information10.1145/3591366.3591378(115-143)Online publication date: 5-Sep-2023
  • Show More Cited By

Index Terms

  1. Bootstrapping pay-as-you-go data integration systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
    June 2008
    1396 pages
    ISBN:9781605581026
    DOI:10.1145/1376616
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data integration
    2. mediated schema
    3. pay-as-you-go
    4. schema mapping

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '08
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)55
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 14 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Identity-Based Secure Key-Deduplication Broadcast Encryption2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM)10.1109/ICONSTEM60960.2024.10568725(1-6)Online publication date: 4-Apr-2024
    • (2023)OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Query Event LogsProceedings of the VLDB Endowment10.14778/3611540.361155516:12(3662-3675)Online publication date: 1-Aug-2023
    • (2023)Linked Data - The Story So FarLinking the World’s Information10.1145/3591366.3591378(115-143)Online publication date: 5-Sep-2023
    • (2023)CQFaRAD: Collaborative Query-Answering Framework for a Research Article DataspaceInternational Journal of Information Technology10.1007/s41870-023-01518-xOnline publication date: 30-Sep-2023
    • (2023)Linking the World’s InformationundefinedOnline publication date: 5-Sep-2023
    • (2022)Automated metadata extraction: challenges and opportunities2022 IEEE 18th International Conference on e-Science (e-Science)10.1109/eScience55777.2022.00088(495-500)Online publication date: Oct-2022
    • (2022)A Framework for Dynamic Composition and Management of Emergency Response ProcessesIEEE Transactions on Services Computing10.1109/TSC.2020.303021115:4(2018-2031)Online publication date: 1-Jul-2022
    • (2022)Application of Data Integration in Dataspace in Multi-value Chain Collaboration of Electric Power Manufacturing Industry2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD54268.2022.9776190(292-298)Online publication date: 4-May-2022
    • (2022)A probabilistic approach: Uncertain navigation of the uncertain webConcurrency and Computation: Practice and Experience10.1002/cpe.719434:23Online publication date: 28-Jul-2022
    • (2021) ASSEMBLE: A ttribute, S tructure and S emantics Based S e rvice M apping Approach for Collaborative B usiness Process Deve l opm e nt IEEE Transactions on Services Computing10.1109/TSC.2018.280534614:2(371-385)Online publication date: 1-Mar-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media