skip to main content
10.1145/1055558.1055560acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

The Lixto data extraction project: back and forth between theory and practice

Published: 14 June 2004 Publication History

Abstract

We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.

References

[1]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.]]
[2]
R. Baumgartner, S. Eichholz, S. Flesca, G. Gottlob, and M. Herzog. Semantic Markup of News Items with Lixto, 2003.]]
[3]
R. Baumgartner, S. Flesca, and G. Gottlob. "Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto". In Proc. LPNMR'01, Vienna, Austria, 2001.]]
[4]
R. Baumgartner, S. Flesca, and G. Gottlob. "Visual Web Information Extraction with Lixto". In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01), 2001.]]
[5]
R. Baumgartner, S. Flesca, G. Gottlob, and M. Herzog. "Building Dynamic Information Portals - A Case Study in the Agrarian Domain". In Proc. IS, 2002.]]
[6]
R. Baumgartner, M. Herzog, and G. Gottlob. "Visual Programming of Web Data Aggregation Applications". In Proc. IIWeb-03, 2003.]]
[7]
S. Cosmadakis, H. Gaifman, P. Kanellakis, and M. Vardi. "Decidable Optimization Problems for Database Logic Programs". In Proceedings of the 20th Annual ACM Symposium on Theory of Computing, pages 477--490, Chicago, Illinois, USA, 1988.]]
[8]
B. Courcelle. "Graph Rewriting: An Algebraic and Logic Approach". In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume 2, chapter 5, pages 193--242. Elsevier Science Publishers B. V., 1990.]]
[9]
E. Dantsin, T. Eiter, G. Gottlob, and A. Voronkov. "Complexity and Expressive Power of Logic Programming". ACM Computing Surveys,33(3):374--425, Sept. 2001.]]
[10]
J. Doner. "Tree Acceptors and some of their Applications". Journal of Computer and System Sciences,4:406--451, 1970.]]
[11]
J. Flum, M. Frick, and M. Grohe. "Query Evaluation via Tree-Decompositions". In Proc. ICDT'01, volume 1973 of LNCS, pages 22--38. Springer, Jan. 2001.]]
[12]
M. Frick, M. Grohe, and C. Koch. "Query Evaluation on Compressed Trees". In Proc. LICS'03, Ottawa, Canada, June 2003.]]
[13]
E. Gold. "Language Identification in the Limit". Inform. Control, 10:447--474, 1967.]]
[14]
G. Gottlob and C. Koch. "Monadic Datalog and the Expressive Power of Web Information Extraction Languages". Journal of the ACM,51(1):74--113, 2004.]]
[15]
G. Gottlob, C. Koch, and R. Pichler. "Efficient Algorithms for Processing XPath Queries". In Proc. VLDB 2002, Hong Kong, China, 2002.]]
[16]
G. Gottlob, C. Koch, and R. Pichler. "The Complexity of XPath Query Processing". In Proc. PODS'03, 2003.]]
[17]
G. Gottlob, C. Koch, and R. Pichler. "XPath Query Evaluation: Improving Time and Space Efficiency". In ICDE'03, Bangalore, India, Mar. 2003.]]
[18]
G. Gottlob, C. Koch, and K. U. Schulz. Conjunctive Queries over Trees. In Proc. PODS'04, 2004.]]
[19]
R. Greenlaw, H. J. Hoover, and W. L. Ruzzo. Limits to Parallel Computation: P-Completeness Theory. Oxford University Press, 1995.]]
[20]
M. Herzog and G. Gottlob: "InfoPipes: A Flexible Framework for M-Commerce Applications". In Proc. TES, 2001.]]
[21]
C. Koch. "Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach". In Proc. VLDB 2003, pages 249--260, 2003.]]
[22]
R. Kosala, H. Blockeel, M. Bruynooghe, and J. V. den Bussche. "Information Extraction from Web Documents based on Local Unranked Tree Automaton Inference". In Proc. IJCAI, 2003.]]
[23]
N. Kushmerick, D. Weld, and R. Doorenbos. "Wrapper Induction for Information Extraction". In Proc. IJCAI, 1997.]]
[24]
A. H. F. Laender, B. Ribeiro-Neto, and A. S. da Silva. "DEByE -- Data Extraction By Example". Data and Knowledge Engineering,40(2):121--154, Feb. 2002.]]
[25]
L. Liu, C. Pu, and W. Han. "XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources". In Proc. ICDE 2000, pages 611--621, San Diego, USA, 2000.]]
[26]
http://www.lixto.com.]]
[27]
B. Ludäscher, R. Himmeröder, G. Lausen, W. May, and C. Schlepphorst. "Managing Semistructured Data with Florid: A Deductive Object-oriented Perspective". Information Systems,23(8):1--25, 1998.]]
[28]
H. Meuss, K. U. Schulz, and F. Bry. "Towards Aggregated Answers for Semistructured Data". In Proc. ICDT'01, pages 346--360, 2001.]]
[29]
M. Minoux. "LTUR: A Simplified Linear-Time Unit Resolution Algorithm for Horn Formulae and Computer Implementation". Information Processing Letters,29(1):1--12, 1988.]]
[30]
Mostrare project. www.grappa.univ-lille3.fr/mostrare/.]]
[31]
I. Muslea, S. Minton, and C. Knoblock. "A Hierarchical Approach to Wrapper Induction". In Proc. 3rd Intern. Conf. on Autonomous Agents, 1999.]]
[32]
F. Neven and T. Schwentick. "Query Automata on Finite Trees". Theoretical Computer Science,275:633--674, 2002.]]
[33]
F. Neven and J. van den Bussche. "Expressiveness of Structured Document Query Languages Based on Attribute Grammars". Journal of the ACM, 49(1):56--100, Jan. 2002.]]
[34]
Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. "A Query Translation Scheme for Rapid Implementation of Wrappers". In DOOD'95, pages 161--186, Singapore, 1995. Springer.]]
[35]
A. Sahuguet and F. Azavant. "Building Intelligent Web Applications Using Lightweight Wrappers". Data and Knowledge Engineering,36(3):283--316, 2001.]]
[36]
H. Seidl, T. Schwentick, and A. Muscholl. "Numerical Document Queries". In Proc. PODS'03, pages 155--166, San Diego, California, 2003.]]
[37]
J. Thatcher and J. Wright. "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-order Logic". Mathematical Systems Theory,2(1):57--81, 1968.]]
[38]
W. Thomas. "Languages, Automata, and Logic". In G. Rozenberg and A. Salomaa, editors, Handbook of Formal Languages, volume 3, chapter 7, pages 389--455. Springer Verlag, 1997.]]
[39]
World Wide Web Consortium. XML Path Language (XPath) Recommendation. http://www.w3c.org/TR/xpath/, Nov. 1999.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2004
350 pages
ISBN:158113858X
DOI:10.1145/1055558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2004

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS04

Acceptance Rates

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Modular materialisation of Datalog programsArtificial Intelligence10.1016/j.artint.2022.103726308:COnline publication date: 1-Jul-2022
  • (2022)Adventures with Datalog: Walking the Thin Line Between Theory and PracticeAIxIA 2022 – Advances in Artificial Intelligence10.1007/978-3-031-27181-6_34(489-500)Online publication date: 28-Nov-2022
  • (2022)Automatic Web Data API Creation via Cross-Lingual Neural Pagination RecognitionWeb Engineering10.1007/978-3-031-09917-5_8(117-131)Online publication date: 5-Jul-2022
  • (2022)Datalog and Logic DatabasesundefinedOnline publication date: 25-Feb-2022
  • (2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
  • (2021)Developing and certifying Datalog optimizations in coq/mathcompProceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs10.1145/3437992.3439913(163-177)Online publication date: 17-Jan-2021
  • (2019)Hybrid Crowd-Machine Wrapper InferenceACM Transactions on Knowledge Discovery from Data10.1145/334472013:5(1-43)Online publication date: 24-Sep-2019
  • (2018)Syntax-guided synthesis of Datalog programsProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236034(515-527)Online publication date: 26-Oct-2018
  • (2018)DatalogDeclarative Logic Programming10.1145/3191315.3191317(3-100)Online publication date: 1-Sep-2018
  • (2018)Browserless Web Data ExtractionProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186008(1095-1104)Online publication date: 10-Apr-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media