ACM Home Page
Please provide us with feedback. Feedback
From dirt to shovels: fully automatic tool generation from ad hoc data
Full text PdfPdf (347 KB)
Source
Annual Symposium on Principles of Programming Languages archive
Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages table of contents
San Francisco, California, USA
SESSION: Session 12 table of contents
Pages 421-434  
Year of Publication: 2008
ISBN:978-1-59593-689-9
Also published in ...
Authors
Kathleen Fisher  AT&T Labs Research, San Jose, CA
David Walker  Princeton University, Princeton, NJ
Kenny Q. Zhu  Princeton University, Princeton, NJ
Peter White  Galois: Inc., Beaverton, OR
Sponsors
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 199,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1328438.1328488
What is a DOI?

ABSTRACT

An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
 
4
5
 
6
David Burke, Kathleen Fisher, David Walker, Peter White, and Kenny Q. Zhu. Towards 1-click tool generation with PADS. In CAGI, Corvallis, OR, June 2007.
 
7
Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994.
 
8
 
9
 
10
Mary F. Fernández, Kathleen Fisher, Robert Gruber, and Yitzhak Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. In PLAN-X, January 2006.
 
11
12
13
14
 
15
E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.
 
16
Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, May 2007.
 
17
Theodore W. Hong. Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College London, 2002.
 
18
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
 
19
Jason L. Hutchens and Michael D. Alder. Finding structure via compression. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 79--82. 1998.
 
20
 
21
Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.
22
 
23
J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145--151, 1991.
24
25
 
26
Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In IJCAI, pages 415--420, 2003.
 
27
 
28
J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.
 
29
PADS Project. PADS project. http://www.padsproj.org/, 2007.
30
 
31
Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction. In ECML, pages 305--316, 2005.
 
32
 
33
 
34
 
35
 
36
 
37
R. M. Wharton. Approximate language identification. Information and Control, 26(3):236--255, 1974.

Collaborative Colleagues:
Kathleen Fisher: colleagues
David Walker: colleagues
Kenny Q. Zhu: colleagues
Peter White: colleagues