Abstract
We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼ 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.
Supplemental Material
- Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling Relational Data: A Survey. VLDB J. 24, 4 (2015), 557–581. Google ScholarDigital Library
- Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput. 75, 2 (1987), 87–106. Google ScholarDigital Library
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494 Google ScholarDigital Library
- Ataccama. 2017. Ataccama One Platform. https://www.ataccama.com/ .Google Scholar
- Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Marco Mauri, Eric Medvet, and Enrico Sorio. 2012. Automatic Generation of Regular Expressions from Examples with Genetic Programming. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings, Terence Soule and Jason H. Moore (Eds.). ACM, 1477–1478. Google ScholarDigital Library
- Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 95–110. Google ScholarDigital Library
- Arka Aloke Bhattacharya, Dezhi Hong, David E. Culler, Jorge Ortiz, Kamin Whitehouse, and Eugene Wu. 2015. Automated Metadata Construction to Support Portable Building Applications. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, BuildSys 2015, Seoul, South Korea, November 4-5, 2015, David Culler, Yuvraj Agarwal, and Rahul Mangharam (Eds.). ACM, 3–12. Google ScholarDigital Library
- Christopher M. Bishop. 2016. Pattern Recognition and Machine Learning. Springer New York. http://www.worldcat.org/ oclc/1005113608Google Scholar
- Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. Google ScholarDigital Library
- Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1–294. Google ScholarDigital Library
- Colin De la Higuera. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press. Google ScholarDigital Library
- Xin Luna Dong and Divesh Srivastava. 2013. Big Data Integration. PVLDB 6, 11 (2013), 1188–1189. http://www.vldb.org/ pvldb/vol6/p1188- srivastava.pdf Google ScholarDigital Library
- Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1638–1645. Google ScholarDigital Library
- Kathleen Fisher, David Walker, and Kenny Qili Zhu. 2008. LearnPADS: automatic tool generation from ad hoc data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 1299–1302. Google ScholarDigital Library
- Wael H Gomaa and Aly A Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (April 2013), 13–18.Google ScholarCross Ref
- Google. 2017. OpenRefine: A free, open source, powerful tool for working with messy data. http://openrefine.org/ .Google Scholar
- Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994. Concrete Mathematics - A Foundation for Computer Science, 2nd Edition). Addison-Wesley. http://www.worldcat.org/oclc/992331503 Google ScholarDigital Library
- Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 317–330. Google ScholarDigital Library
- Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119.Google ScholarCross Ref
- Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio (Eds.). Morgan Kaufmann, 311–322. http://www.vldb.org/conf/1995/P311.PDF Google ScholarDigital Library
- Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On Clustering Validation Techniques. J. Intell. Inf. Syst. 17, 2-3 (2001), 107–145. Google ScholarDigital Library
- Steve Hanneke. 2014. Theory of Disagreement-Based Active Learning. Found. Trends Mach. Learn. 7, 2-3 (June 2014), 131–309. Google ScholarDigital Library
- Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19–30. http: //www.vldb.org/conf/2003/papers/S02P01.pdf Google ScholarDigital Library
- Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 3 (1999), 264–323. Google ScholarDigital Library
- Panagiotis Karras and Nikos Mamoulis. 2007. The Haar+ Tree: A Refined Synopsis Data Structure. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 436–445.Google ScholarCross Ref
- Dileep Kini and Sumit Gulwani. 2015. FlashNormalize: Programming by Examples for Text Normalization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776–783. http://ijcai.org/Abstract/15/115 Google ScholarDigital Library
- Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67.Google ScholarDigital Library
- Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 542–553. Google ScholarDigital Library
- Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, Vol. 10. 707–710. http://adsabs.harvard.edu/abs/1966SPhD...10..707LGoogle Scholar
- Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 21–30. http://www.aclweb.org/anthology/D08- 1003 Google ScholarDigital Library
- Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.Google Scholar
- Steve Lohr. 2014. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times 17 (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle Scholar
- James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA., 281–297.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarDigital Library
- Arkady Maydanchik. 2007. Data Quality Assessment. Technics Publications. https://technicspub.com/ data- quality- assessment/ Google ScholarDigital Library
- Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, Benjamin G. Zorn, and Sumit Gulwani. 2015. User Interaction Models for Disambiguation in Programming by Example. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, UIST 2015, Charlotte, NC, USA, November 8-11, 2015, Celine Latulipe, Bjoern Hartmann, and Tovi Grossman (Eds.). ACM, 291–301. Google ScholarDigital Library
- Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https://www.youtube.com/watch?v=9KG0Sc2B2KI .Google Scholar
- Microsoft. 2017b. Data Transformations "By Example" in the Azure ML Workbench. https://blogs.technet.microsoft.com/ machinelearning/2017/09/25/by- example- transformations- in- the- azure- machine- learning- workbench/ .Google Scholar
- Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https://docs.microsoft.com/en- gb/sql/ssdt .Google Scholar
- Microsoft. 2017d. Program Synthesis using Examples SDK. https://microsoft.github.io/prose/ .Google Scholar
- José Oncina and Pedro García. 1992. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition 5, 99-108 (1992), 15–20.Google ScholarCross Ref
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107–126. Google ScholarDigital Library
- Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381–390. http://www.vldb.org/conf/2001/P381.pdf Google ScholarDigital Library
- Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761–774. Google ScholarDigital Library
- Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. http://www.vldb.org/pvldb/vol9/p816- singh.pdf Google ScholarDigital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006, John Paul Shen and Margaret Martonosi (Eds.). ACM, 404–415. Google ScholarDigital Library
- Thorvald Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5 (1948), 1–34.Google Scholar
- Borge Svingen. 1998. Learning Regular Languages using Genetic Programming. Proc. Genetic Programming (1998), 374–376.Google Scholar
- Andrei N Tikhonov. 1963. Solution of Incorrectly Formulated Problems and the Regularization Method. In Dokl. Akad. Nauk., Vol. 151. 1035–1038.Google Scholar
- Trifacta. 2017. Trifacta Wrangler. https://www.trifacta.com/products/wrangler/ .Google Scholar
- William E Winkler. 1999. The State of Record Linkage and Current Research Problems.Google Scholar
- Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition. Elsevier Science & Technology. http://www.worldcat.org/oclc/1007085077 Google ScholarDigital Library
- Rui Xu and Donald C. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Trans. Neural Networks 16, 3 (2005), 645–678. Google ScholarDigital Library
- Kenny Qili Zhu, Kathleen Fisher, and David Walker. 2012. LearnPADS++ : Incremental Inference of Ad Hoc Data Formats. In Practical Aspects of Declarative Languages - 14th International Symposium, PADL 2012, Philadelphia, PA, USA, January 23-24, 2012. Proceedings (Lecture Notes in Computer Science), Claudio V. Russo and Neng-Fa Zhou (Eds.), Vol. 7149. Springer, 168–182. Google ScholarDigital Library
Index Terms
- FlashProfile: a framework for synthesizing data profiles
Recommendations
Synthesizing transformations on hierarchically structured data
PLDI '16This paper presents a new approach for synthesizing transformations on tree-structured data, such as Unix directories and XML documents. We consider a general abstraction for such data, called hierarchical data trees (HDTs) and present a novel example-...
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Datais to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Synthesizing data structure manipulations from storyboards
ESEC/FSE '11: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineeringWe present the Storyboard Programming framework, a new synthesis system designed to help programmers write imperative low-level data-structure manipulations. The goal of this system is to bridge the gap between the "boxes-and-arrows" diagrams that ...
Comments