|
ABSTRACT
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
Vinayak Borkar , Kaustubh Deshmukh , Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
|
| |
6
|
David Burke, Kathleen Fisher, David Walker, Peter White, and Kenny Q. Zhu. Towards 1-click tool generation with PADS. In CAGI, Corvallis, OR, June 2007.
|
| |
7
|
Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994.
|
| |
8
|
|
| |
9
|
|
| |
10
|
Mary F. Fernández, Kathleen Fisher, Robert Gruber, and Yitzhak Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. In PLAN-X, January 2006.
|
| |
11
|
|
 |
12
|
|
 |
13
|
Kathleen Fisher , Yitzhak Mandelbaum , David Walker, The next 700 data description languages, Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.2-15, January 11-13, 2006, Charleston, South Carolina, USA
|
 |
14
|
|
| |
15
|
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
|
| |
16
|
Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, May 2007.
|
| |
17
|
Theodore W. Hong. Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College London, 2002.
|
| |
18
|
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
|
| |
19
|
Jason L. Hutchens and Michael D. Alder. Finding structure via compression. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 79--82. 1998.
|
| |
20
|
|
| |
21
|
Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.
|
 |
22
|
|
| |
23
|
J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145--151, 1991.
|
 |
24
|
Yitzhak Mandelbaum , Kathleen Fisher , David Walker , Mary Fernandez , Artem Gleyzer, PADS/ML: a functional data description language, Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, January 17-19, 2007, Nice, France
|
 |
25
|
|
| |
26
|
Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In IJCAI, pages 415--420, 2003.
|
| |
27
|
|
| |
28
|
J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.
|
| |
29
|
PADS Project. PADS project. http://www.padsproj.org/, 2007.
|
 |
30
|
|
| |
31
|
Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction. In ECML, pages 305--316, 2005.
|
| |
32
|
|
| |
33
|
Kurt A. Shoens , Allen Luniewski , Peter M. Schwarz , James W. Stamos , Joachim Thomas, II, The Rufus System: Information Organization for Semi-Structured Data, Proceedings of the 19th International Conference on Very Large Data Bases, p.97-107, August 24-27, 1993
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
R. M. Wharton. Approximate language identification. Information and Control, 26(3):236--255, 1974.
|
|