skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

FlashProfile: a framework for synthesizing data profiles

Published:24 October 2018Publication History
Skip Abstract Section

Abstract

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.

Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.

Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼ 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.

Skip Supplemental Material Section

Supplemental Material

a150-padhi.webm

webm

86.9 MB

References

  1. Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling Relational Data: A Survey. VLDB J. 24, 4 (2015), 557–581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput. 75, 2 (1987), 87–106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ataccama. 2017. Ataccama One Platform. https://www.ataccama.com/ .Google ScholarGoogle Scholar
  5. Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Marco Mauri, Eric Medvet, and Enrico Sorio. 2012. Automatic Generation of Regular Expressions from Examples with Genetic Programming. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings, Terence Soule and Jason H. Moore (Eds.). ACM, 1477–1478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 95–110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Arka Aloke Bhattacharya, Dezhi Hong, David E. Culler, Jorge Ortiz, Kamin Whitehouse, and Eugene Wu. 2015. Automated Metadata Construction to Support Portable Building Applications. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, BuildSys 2015, Seoul, South Korea, November 4-5, 2015, David Culler, Yuvraj Agarwal, and Rahul Mangharam (Eds.). ACM, 3–12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christopher M. Bishop. 2016. Pattern Recognition and Machine Learning. Springer New York. http://www.worldcat.org/ oclc/1005113608Google ScholarGoogle Scholar
  9. Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1–294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Colin De la Higuera. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xin Luna Dong and Divesh Srivastava. 2013. Big Data Integration. PVLDB 6, 11 (2013), 1188–1189. http://www.vldb.org/ pvldb/vol6/p1188- srivastava.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1638–1645. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kathleen Fisher, David Walker, and Kenny Qili Zhu. 2008. LearnPADS: automatic tool generation from ad hoc data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 1299–1302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Wael H Gomaa and Aly A Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (April 2013), 13–18.Google ScholarGoogle ScholarCross RefCross Ref
  16. Google. 2017. OpenRefine: A free, open source, powerful tool for working with messy data. http://openrefine.org/ .Google ScholarGoogle Scholar
  17. Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994. Concrete Mathematics - A Foundation for Computer Science, 2nd Edition). Addison-Wesley. http://www.worldcat.org/oclc/992331503 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 317–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119.Google ScholarGoogle ScholarCross RefCross Ref
  20. Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio (Eds.). Morgan Kaufmann, 311–322. http://www.vldb.org/conf/1995/P311.PDF Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On Clustering Validation Techniques. J. Intell. Inf. Syst. 17, 2-3 (2001), 107–145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Steve Hanneke. 2014. Theory of Disagreement-Based Active Learning. Found. Trends Mach. Learn. 7, 2-3 (June 2014), 131–309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19–30. http: //www.vldb.org/conf/2003/papers/S02P01.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 3 (1999), 264–323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Panagiotis Karras and Nikos Mamoulis. 2007. The Haar+ Tree: A Refined Synopsis Data Structure. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 436–445.Google ScholarGoogle ScholarCross RefCross Ref
  26. Dileep Kini and Sumit Gulwani. 2015. FlashNormalize: Programming by Examples for Text Normalization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776–783. http://ijcai.org/Abstract/15/115 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 542–553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, Vol. 10. 707–710. http://adsabs.harvard.edu/abs/1966SPhD...10..707LGoogle ScholarGoogle Scholar
  30. Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 21–30. http://www.aclweb.org/anthology/D08- 1003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.Google ScholarGoogle Scholar
  32. Steve Lohr. 2014. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times 17 (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle ScholarGoogle Scholar
  33. James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA., 281–297.Google ScholarGoogle Scholar
  34. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Arkady Maydanchik. 2007. Data Quality Assessment. Technics Publications. https://technicspub.com/ data- quality- assessment/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, Benjamin G. Zorn, and Sumit Gulwani. 2015. User Interaction Models for Disambiguation in Programming by Example. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, UIST 2015, Charlotte, NC, USA, November 8-11, 2015, Celine Latulipe, Bjoern Hartmann, and Tovi Grossman (Eds.). ACM, 291–301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https://www.youtube.com/watch?v=9KG0Sc2B2KI .Google ScholarGoogle Scholar
  38. Microsoft. 2017b. Data Transformations "By Example" in the Azure ML Workbench. https://blogs.technet.microsoft.com/ machinelearning/2017/09/25/by- example- transformations- in- the- azure- machine- learning- workbench/ .Google ScholarGoogle Scholar
  39. Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https://docs.microsoft.com/en- gb/sql/ssdt .Google ScholarGoogle Scholar
  40. Microsoft. 2017d. Program Synthesis using Examples SDK. https://microsoft.github.io/prose/ .Google ScholarGoogle Scholar
  41. José Oncina and Pedro García. 1992. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition 5, 99-108 (1992), 15–20.Google ScholarGoogle ScholarCross RefCross Ref
  42. Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107–126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381–390. http://www.vldb.org/conf/2001/P381.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761–774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. http://www.vldb.org/pvldb/vol9/p816- singh.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006, John Paul Shen and Margaret Martonosi (Eds.). ACM, 404–415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Thorvald Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5 (1948), 1–34.Google ScholarGoogle Scholar
  48. Borge Svingen. 1998. Learning Regular Languages using Genetic Programming. Proc. Genetic Programming (1998), 374–376.Google ScholarGoogle Scholar
  49. Andrei N Tikhonov. 1963. Solution of Incorrectly Formulated Problems and the Regularization Method. In Dokl. Akad. Nauk., Vol. 151. 1035–1038.Google ScholarGoogle Scholar
  50. Trifacta. 2017. Trifacta Wrangler. https://www.trifacta.com/products/wrangler/ .Google ScholarGoogle Scholar
  51. William E Winkler. 1999. The State of Record Linkage and Current Research Problems.Google ScholarGoogle Scholar
  52. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition. Elsevier Science & Technology. http://www.worldcat.org/oclc/1007085077 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rui Xu and Donald C. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Trans. Neural Networks 16, 3 (2005), 645–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Kenny Qili Zhu, Kathleen Fisher, and David Walker. 2012. LearnPADS++ : Incremental Inference of Ad Hoc Data Formats. In Practical Aspects of Declarative Languages - 14th International Symposium, PADL 2012, Philadelphia, PA, USA, January 23-24, 2012. Proceedings (Lecture Notes in Computer Science), Claudio V. Russo and Neng-Fa Zhou (Eds.), Vol. 7149. Springer, 168–182. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FlashProfile: a framework for synthesizing data profiles

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader