ABSTRACT
The World-Wide Web consists not only of a huge number of unstructured texts, but also a vast amount of valuable structured data. Web tables [2] are a typical type of structured information that are pervasive on the web, and Web-scale methods that automatically extract web tables have been studied extensively [1]. Many powerful systems (e.g.OCTOPUS [4], Mesa [3]) use extracted web tables as a fundamental component.
In the database vernacular, a table is defined as a set of tuples which have the same attributes. Similarly, a web table is defined as a set of rows (corresponding to database tuples) which have the same column headers (corresponding to database attributes). Therefore, to extract a web table is to extract a relation on the web. In databases, tables often contain foreign keys which refer to other tables. Therefore, it follows that hyperlinks inside a web table sometimes function as foreign keys to other relations whose tuples are contained in the hyperlink's target pages. In this paper, we explore this idea by asking: can we discover new attributes for web tables by exploring hyperlinks inside web tables?
This poster proposes a solution that takes a web table as input. Frequent patterns are generated as new candidate relations by following hyperlinks in the web table. The confidence of candidates are evaluated, and trustworthy candidates are selected to become new attributes for the table. Finally, we show the usefulness of our method by performing experiments on a variety of web domains.
- G Miao, J. Tatemura, W.-P Hsiung, A. Sawires and L. E. Moser, Extracting data records from the web using tag path clustering In WWW, p981--990, 2009. Google ScholarDigital Library
- M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu and Y. Zhang, WebTables: exploring the power of tables on the web, In VLDB, p.538--549, 2008. Google ScholarDigital Library
- S. Mergen, J. Freire and C. Heuser Mesa: A Search Engine for Querying Web Tables, In SBBD, demo, 2008.Google Scholar
- M. J. Cafarella, A. Y. Halevy and N. Khoussainova, Data Integration for the Relational Web, VLDB, p.1090--1101, 2009. Google ScholarDigital Library
- J. Han and J. Pei, Mining Frequent Patterns by Pattern-Growth: Methodology and Implications, In SIGKDD Exploration, p.13--20, 2000 Google ScholarDigital Library
- A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni and S. Soderland, TextRunner: Open Information Extraction on the Web, In HLT-NAACL, p.25--26, 2007. Google ScholarDigital Library
- A. Culotta, A. McCallum and J. Betz, Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text, In HLT-NAACL, 2006. Google ScholarDigital Library
Index Terms
- Entity relation discovery from web tables and links
Recommendations
Constraint-driven join processing in a web warehouse
There has been considerable research in join operation in relational databases. In this paper, we introduce the concept of web join for combining hyperlinked Web data. Web join is one of the web algebraic operator in our web warehousing system called ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Enhancing browsing experience of table and image elements in web pages
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal InteractionAs the popularity and diversification of both Internet and its access devices, users' browsing experience of web pages is in great need of improvement. Traditional browsing mode of web elements such as table and image is passive, which limits users' ...
Comments