research-article

Open Access

FlashProfile: a framework for synthesizing data profiles

Authors:
Saswat Padhi

University of California at Los Angeles, USA

University of California at Los Angeles, USA
View Profile

,
Prateek Jain

Microsoft Research Lab, India

Microsoft Research Lab, India
View Profile

,
Daniel Perelman

Microsoft, USA

Microsoft, USA
View Profile

,
Oleksandr Polozov

Microsoft Research, USA

Microsoft Research, USA
View Profile

,
Sumit Gulwani

Microsoft, USA

Microsoft, USA
View Profile

,
Todd Millstein

University of California at Los Angeles, USA

University of California at Los Angeles, USA
View Profile

Proceedings of the ACM on Programming Languages Volume 2 Issue OOPSLAArticle No.: 150pp 1–28https://doi.org/10.1145/3276520

Published:24 October 2018Publication History

Related Artifact: Replication Package for Article: "FlashProfile: A Framework for Synthesizing Data Profiles" September 2018 software https://doi.org/10.1145/3276936

Proceedings of the ACM on Programming Languages

Abstract

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.

Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.

Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼ 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.

Supplemental Material

a150-padhi.webm

webm

86.9 MB

Download

References

Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling Relational Data: A Survey. VLDB J. 24, 4 (2015), 557–581. Google ScholarDigital Library
Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput. 75, 2 (1987), 87–106. Google ScholarDigital Library
David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494 Google ScholarDigital Library
Ataccama. 2017. Ataccama One Platform. https://www.ataccama.com/ .Google Scholar
Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Marco Mauri, Eric Medvet, and Enrico Sorio. 2012. Automatic Generation of Regular Expressions from Examples with Genetic Programming. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings, Terence Soule and Jason H. Moore (Eds.). ACM, 1477–1478. Google ScholarDigital Library
Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 95–110. Google ScholarDigital Library
Arka Aloke Bhattacharya, Dezhi Hong, David E. Culler, Jorge Ortiz, Kamin Whitehouse, and Eugene Wu. 2015. Automated Metadata Construction to Support Portable Building Applications. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, BuildSys 2015, Seoul, South Korea, November 4-5, 2015, David Culler, Yuvraj Agarwal, and Rahul Mangharam (Eds.). ACM, 3–12. Google ScholarDigital Library
Christopher M. Bishop. 2016. Pattern Recognition and Machine Learning. Springer New York. http://www.worldcat.org/ oclc/1005113608Google Scholar
Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. Google ScholarDigital Library
Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1–294. Google ScholarDigital Library
Colin De la Higuera. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press. Google ScholarDigital Library
Xin Luna Dong and Divesh Srivastava. 2013. Big Data Integration. PVLDB 6, 11 (2013), 1188–1189. http://www.vldb.org/ pvldb/vol6/p1188- srivastava.pdf Google ScholarDigital Library
Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1638–1645. Google ScholarDigital Library
Kathleen Fisher, David Walker, and Kenny Qili Zhu. 2008. LearnPADS: automatic tool generation from ad hoc data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 1299–1302. Google ScholarDigital Library
Wael H Gomaa and Aly A Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (April 2013), 13–18.Google ScholarCross Ref
Google. 2017. OpenRefine: A free, open source, powerful tool for working with messy data. http://openrefine.org/ .Google Scholar
Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994. Concrete Mathematics - A Foundation for Computer Science, 2nd Edition). Addison-Wesley. http://www.worldcat.org/oclc/992331503 Google ScholarDigital Library
Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 317–330. Google ScholarDigital Library
Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119.Google ScholarCross Ref
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio (Eds.). Morgan Kaufmann, 311–322. http://www.vldb.org/conf/1995/P311.PDF Google ScholarDigital Library
Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On Clustering Validation Techniques. J. Intell. Inf. Syst. 17, 2-3 (2001), 107–145. Google ScholarDigital Library
Steve Hanneke. 2014. Theory of Disagreement-Based Active Learning. Found. Trends Mach. Learn. 7, 2-3 (June 2014), 131–309. Google ScholarDigital Library
Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19–30. http: //www.vldb.org/conf/2003/papers/S02P01.pdf Google ScholarDigital Library
Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 3 (1999), 264–323. Google ScholarDigital Library
Panagiotis Karras and Nikos Mamoulis. 2007. The Haar+ Tree: A Refined Synopsis Data Structure. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 436–445.Google ScholarCross Ref
Dileep Kini and Sumit Gulwani. 2015. FlashNormalize: Programming by Examples for Text Normalization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776–783. http://ijcai.org/Abstract/15/115 Google ScholarDigital Library
Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67.Google ScholarDigital Library
Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 542–553. Google ScholarDigital Library
Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, Vol. 10. 707–710. http://adsabs.harvard.edu/abs/1966SPhD...10..707LGoogle Scholar
Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 21–30. http://www.aclweb.org/anthology/D08- 1003 Google ScholarDigital Library
Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.Google Scholar
Steve Lohr. 2014. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times 17 (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle Scholar
James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA., 281–297.Google Scholar
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarDigital Library
Arkady Maydanchik. 2007. Data Quality Assessment. Technics Publications. https://technicspub.com/ data- quality- assessment/ Google ScholarDigital Library
Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, Benjamin G. Zorn, and Sumit Gulwani. 2015. User Interaction Models for Disambiguation in Programming by Example. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, UIST 2015, Charlotte, NC, USA, November 8-11, 2015, Celine Latulipe, Bjoern Hartmann, and Tovi Grossman (Eds.). ACM, 291–301. Google ScholarDigital Library
Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https://www.youtube.com/watch?v=9KG0Sc2B2KI .Google Scholar
Microsoft. 2017b. Data Transformations "By Example" in the Azure ML Workbench. https://blogs.technet.microsoft.com/ machinelearning/2017/09/25/by- example- transformations- in- the- azure- machine- learning- workbench/ .Google Scholar
Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https://docs.microsoft.com/en- gb/sql/ssdt .Google Scholar
Microsoft. 2017d. Program Synthesis using Examples SDK. https://microsoft.github.io/prose/ .Google Scholar
José Oncina and Pedro García. 1992. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition 5, 99-108 (1992), 15–20.Google ScholarCross Ref
Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107–126. Google ScholarDigital Library
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381–390. http://www.vldb.org/conf/2001/P381.pdf Google ScholarDigital Library
Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761–774. Google ScholarDigital Library
Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. http://www.vldb.org/pvldb/vol9/p816- singh.pdf Google ScholarDigital Library
Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006, John Paul Shen and Margaret Martonosi (Eds.). ACM, 404–415. Google ScholarDigital Library
Thorvald Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5 (1948), 1–34.Google Scholar
Borge Svingen. 1998. Learning Regular Languages using Genetic Programming. Proc. Genetic Programming (1998), 374–376.Google Scholar
Andrei N Tikhonov. 1963. Solution of Incorrectly Formulated Problems and the Regularization Method. In Dokl. Akad. Nauk., Vol. 151. 1035–1038.Google Scholar
Trifacta. 2017. Trifacta Wrangler. https://www.trifacta.com/products/wrangler/ .Google Scholar
William E Winkler. 1999. The State of Record Linkage and Current Research Problems.Google Scholar
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition. Elsevier Science & Technology. http://www.worldcat.org/oclc/1007085077 Google ScholarDigital Library
Rui Xu and Donald C. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Trans. Neural Networks 16, 3 (2005), 645–678. Google ScholarDigital Library
Kenny Qili Zhu, Kathleen Fisher, and David Walker. 2012. LearnPADS++ : Incremental Inference of Ad Hoc Data Formats. In Practical Aspects of Declarative Languages - 14th International Symposium, PADL 2012, Philadelphia, PA, USA, January 23-24, 2012. Proceedings (Lecture Notes in Computer Science), Claudio V. Russo and Neng-Fa Zhou (Eds.), Vol. 7149. Springer, 168–182. Google ScholarDigital Library

Index Terms

FlashProfile: a framework for synthesizing data profiles

Recommendations

Synthesizing transformations on hierarchically structured data
PLDI '16

This paper presents a new approach for synthesizing transformations on tree-structured data, such as Unix directories and XML documents. We consider a general abstraction for such data, called hierarchical data trees (HDTs) and present a novel example-...
Read More
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Read More
Synthesizing data structure manipulations from storyboards
ESEC/FSE '11: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering

We present the Storyboard Programming framework, a new synthesis system designed to help programmers write imperative low-level data-structure manipulations. The goal of this system is to bridge the gap between the "boxes-and-arrows" diagrams that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the ACM on Programming Languages Volume 2, Issue OOPSLA
November 2018
1656 pages
EISSN:2475-1421
DOI:10.1145/3288538
Issue’s Table of Contents

Copyright © 2018 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2018
Published in pacmpl Volume 2, Issue OOPSLA

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
- Artifacts Evaluated & Functional
Author Tags
data profiling
hierarchical clustering
outlier detection
pattern learning
pattern profiles
program synthesis
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 750
  Total Downloads
- Downloads (Last 12 months)139
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

FlashProfile: a framework for synthesizing data profiles

Proceedings of the ACM on Programming Languages

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Synthesizing transformations on hierarchically structured data

Data Profiling: A Tutorial

Synthesizing data structure manipulations from storyboards