Abstract
The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text. We describe WHIRL, a “soft” database management system which supports “similarity joins,” based on certain robust, general-purpose similarity metrics for text. This enables fragments of text (e.g., informal names of objects) to be used as keys. WHIRL includes textual objects as a built-in type, similarity reasoning as a built-in predicate, and answers every query with a list of answer substitutions that are ranked according to an overall score. Experiments show that WHIRL is much faster than naive inference methods, even for short queries, and efficient on typical queries to real-world databases with tens of thousands of tuples. Inferences made by WHIRL are also surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outerperforming exact matching with a plausible global domain on a second.
- ABITEBOUL,S.AND VIANU, V. 1997. Regular path queries with constraints. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Sys-tems (PODS-97) (Tucson, AZ, May 1997). Google Scholar
- ARENS, Y., KNOBLOCK,C.A.,AND HSU, C.-N. 1996. Query processing in the SIMS informa-tion mediator. In A. Tate Ed., Advanced Planning Technology. Menlo Park, CA: AAAI Press.Google Scholar
- ATZENI, P., MECCA, G., AND MERIALDO, P. 1997. Semistructured and structured data on the Web: going back and forth. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.re-search. att.com/ ; suciu/workshop-papers.html.Google Scholar
- BARBARA, D., GARCIA-MOLINA, H., AND PORTER, D. 1992. The management of probabilistic data. IEEE Transations on knowledge and data engineering 4, 5 (October), 487-501. Google Scholar
- BARTELL,B.T.,COTTRELL,G.W.,AND BELEW, R. K. 1994. Automatic combination of multiple ranked retrieval systems. In Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1994). Google Scholar
- BAYARDO,R.J.,BOHRER, W., BRICE, R., CICHOCKI, A., FOWLER, J., HELAL, A., KASHYAP, V., KSIEZYK, T., MARTIN, G., NODINE, M., RASHID, M., RUSINKIEWICZ, M., SHEA, R., UNNIKRISHAN, C., UNRUH, A., AND WOELK, D. 1997. Infosleuth: an agent-based semantic integration of information in open and dynamic environments. In Proceedings of the 1997 ACM SIGMOD (May 1997). Google Scholar
- BORGMAN,C.L.AND SIEGFRIED, S. L. 1992. Getty's Synoname and its cousins: a survey of applications of personal name-matching algorithms. Journal of the American Society for Information Science 43, 7, 459-476.Google Scholar
- BOSC,P.AND PRADE, H. 1997. An introduction to the fuzzy set and possibility theory-based treatment of queries and uncertain or imprecise databases. In Uncertainty management in information systems. Kluwer Academic Publishers.Google Scholar
- BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1994. A machine learning architecture for optimiz-ing web search engines. Technical Report WS-96-05, American Association of Artificial Intelligence.Google Scholar
- CHAUDHURI, S., DAYAL, U., AND YAN, T. 1995. Join queries with external text sources: execution and optimization techniques. In Proceedings of the 1995 ACM SIGMOD (May 1995). Google Scholar
- COHEN, W. W. 1997. Knowledge integration for structured information sources containing text (extended abstract). In The SIGIR-97 Workshop on Networked Information Retrieval (1997).Google Scholar
- COHEN, W. W. 1998a. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD-98 (Seattle, WA, 1998). Google Scholar
- COHEN, W. W. 1998b. A Web-based information system that reasons with structured collections of text. In Proceedings of Autonomous Agents-98 (St. Paul, MN, 1998). Google Scholar
- COHEN,W.W.AND HIRSH, H. 1998. Joins that generalize: Text categorization using WHIRL. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (New York, NY, 1998), pp. 169-173.Google Scholar
- COHEN,W.W.AND SINGER, Y. 1996. Context-sensitive learning methods for text categoriza-tion. In Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (Zurich, Switzerland, 1996), pp. 307-315. ACM Press. Google Scholar
- COHEN,W.W.,SCHAPIRE,R.E.,AND SINGER, Y. 1997. Learning to order things. In Advances in Neural Processing Systems 10 (Denver, CO, 1997). MIT Press. Google Scholar
- CRAVEN, M., DIPASQUO, D., FREITAG, D., MCCALLUM, A., MITCHELL, T., NIGAM, K., AND SLATTERY, S. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (Madison, WI, 1998). Google Scholar
- DUSCHKA,O.M.AND GENESERETH, M. R. 1997a. Answering recursive queries using views. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97) (Tucson, AZ, May 1997). Google Scholar
- DUSCHKA,O.M.AND GENESERETH, M. R. 1997b. Query planning in infomaster. In Proceed-ings of the Twelfth Annual ACM Symposium on Applied Computing (SAC97) (San Jose, CA, February 1997). Google Scholar
- FAGAN, J. L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40, 2, 115-132. Google Scholar
- FAGIN, R. 1998. Fuzzy queries in multimedia database systems. In Proc. 1998 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'98) (1998). Google Scholar
- FANG, D., HAMMER, J., AND MCLEOD, D. 1994. The identification and resolution of semantic heterogeneity in multidatabase systems. In Multidatabase Systems: An Advanced Solution for Global Information Sharing, pp. 52-60. IEEE Computer Society Press, Los Alamitos, California.Google Scholar
- FELLIGI,I.P.AND SUNTER, A. B. 1969. A theory for record linkage. Journal of the American Statistical Society 64, 1183-1210.Google Scholar
- FIEBIG, T., WEISS, J., AND MOERKOTTE, G. 1997. RAW: a relational algebra for the Web. In D. Suciu Ed., Proceedings of the Workshop on Management of Semistructured Data (Tucson, Arizona, May 1997). Available on-line from http://www.research.att.com/ ; suciu/workshop-papers. html.Google Scholar
- FUHR, N. 1995. Probabilistic Datalog:a logic for powerful retrieval methods. In Proceed-ings of the 1995 ACM SIGIR conference on research in information retrieval (New York, 1995), pp. 282-290. Google Scholar
- GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM, J. 1995. The TSIMMIS approach to mediation: Data models and languages (extended abstract). In Next Generation Information Technologies and Systems (NGITS-95) (Naharia, Israel, November 1995).Google Scholar
- HERNANDEZ,M.AND STOLFO, S. 1995. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD (May 1995). Google Scholar
- HUFFMAN,S.AND STEIER, D. 1995. Heuristic joins to integrate structured heterogeneous data. In Working notes of the AAAI spring symposium on information gathering in heteroge-neous distributed environments (Palo Alto, CA, March 1995). AAAI Press.Google Scholar
- KILSS,B.AND ALVEY, W. 1985. Record linkage techniques:1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.bts. gov/fcsm/methodology/.Google Scholar
- KNUTH, D. E. 1975. The Art of Computer Programming, Volume I: Fundamental Algorithms (second edition). Addison-Wesley, Reading, MA. Google Scholar
- KONOPNICKI,D.AND SCHMUELI, O. 1995. W3QS: a query system for the world wide web. In Proceedings of the 21st International Conference on Very Large Databases (VLDB-96) (Zurich, Switzerland, 1995). Google Scholar
- KORF, R. 1993. Linear-space best-first search. Artificial Intelligence 62, 1 (July), 41-78. Google Scholar
- LEVY,A.Y.,RAJARAMAN, A., AND ORDILLE, J. J. 1996a. Query answering algorithms for information agents. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96) (Portland, Oregon, August 1996). Google Scholar
- LEVY,A.Y.,RAJARAMAN, A., AND ORDILLE, J. J. 1996b. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB-96) (Bombay, India, September 1996). Google Scholar
- LEWIS, D. 1992. Representation and learning in information retrieval. Technical Report 91-93, Computer Science Dept., University of Massachusetts at Amherst. PhD Thesis. Google Scholar
- MENDELZON,A.AND MILO, T. 1997. Formal models of Web queries. In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS-97) (Tucson, AZ, May 1997). Google Scholar
- MONGE,A.AND ELKAN, C. 1996. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (August 1996).Google Scholar
- MONGE,A.AND ELKAN, C. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 work-shop on data mining and knowledge discovery (May 1997).Google Scholar
- NEWCOMBE,H.B.,KENNEDY,J.M.,AXFORD,S.J.,AND JAMES, A. P. 1959. Automatic linkage of vital records. Science 130, 954-959.Google Scholar
- NILSSON, N. 1987. Principles of Artificial Intelligence. Morgan Kaufmann. Google Scholar
- PEARL, J. 1984. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Reading, MA. Google Scholar
- PORTER, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.Google Scholar
- QUINLAN, J. R. 1990. Learning logical definitions from relations. Machine Learning 5,3, 239-266. Google Scholar
- SALTON,G.ED. 1989. Automatic Text Processing. Addison Wesley, Reading, Massachusetts. Google Scholar
- SCHA ~ UBLE, P. 1993. SPIDER: A multiuser information retrieval system for semistructured and dynamic data. In Proceedings of the 1993 ACM SIGIR conference on research in information retrieval (Pittsburgh, PA, 1993), pp. 318-327. Google Scholar
- SUCIU, D. 1996. Query decomposition and view maintenance for query languages for un-structured data. In Proceedings of the 22nd International Conference on Very Large Data-bases (VLDB-96) (Bombay, India, 1996). Google Scholar
- SUCIU,D.ED. 1997. Proceedings of the Workshop on Management of Semistructured Data. Available on-line from http://www.research.att.com/suciu/workshop-papers.html, Tucson, Arizona. Google Scholar
- TOMASIC, A., AMOUROUX, R., BONNET, P., AND KAPITSKAIA, O. 1997. The distributed informa-tion search component (Disco) and the World Wide Web. In Proceedings of the 1997 ACM SIGMOD (May 1997). Google Scholar
- TURTLE,H.AND FLOOD, J. 1995. Query evaluation: strategies and optimizations. Informa-tion processing and management 31, 6 (November), 831-850. Google Scholar
- ZADEH, L. A. 1965. Fuzzy sets. Information and Control 8, 338-353.Google Scholar
Index Terms
- Data integration using similarity joins and a word-based information representation language
Recommendations
Augmenting Data Retrieval with Information Retrieval Techniques by Using Word Similarity
NLDB '08: Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information SystemsData retrieval (DR) and information retrieval (IR) have traditionally occupied two distinct niches in the world of information systems. DR systems effectively store and query structured data, but lack the flexibility of IR, i.e., the ability to retrieve ...
Similarity Joins
Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...
Top-k String Similarity Joins
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database ManagementTop-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types ...
Comments