research-article

An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Authors:

Tik-Shun WongAuthors Info & Claims

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 35 - 42

https://doi.org/10.1145/1390334.1390343

Published: 20 July 2008 Publication History

Abstract

We have developed an unsupervised framework for simultaneously extracting and normalizing attributes of products from multiple Web pages originated from different sites. Our framework is designed based on a probabilistic graphical model that can model the page-independent content information and the page-dependent layout information of the text fragments in Web pages. One characteristic of our framework is that previously unseen attributes can be discovered from the clue contained in the layout format of the text fragments. Our framework tackles both extraction and normalization tasks by jointly considering the relationship between the content and layout information. Dirichlet process prior is employed leading to another advantage that the number of discovered product attributes is unlimited. An unsupervised inference algorithm based on variational method is presented. The semantics of the normalized attributes can be visualized by examining the term weights in the model. Our framework can be applied to a wide range of Web mining applications such as product matching and retrieval. We have conducted extensive experiments from four different domains consisting of over 300 Web pages from over 150 different Web sites, demonstrating the robustness and effectiveness of our framework.

References

[1]

I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, pages 47--58, 2006.

[2]

M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48, 2003.

Digital Library

[3]

D. Blei and M. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1):121--144, 2006.

[4]

S.-L. Chuang, K. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proceedings of the Thirty-Third Very Large Databases Conference, pages 699--710, 2007.

Digital Library

[5]

V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the Twenty-Seventh Very Large Databases Conference, pages 109--118, 2001.

Digital Library

[6]

J. Ishwaran and L. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161--174, 2001.

[7]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of Eighteenth International Conference on Machine Learning, pages 282--289, 2001.

Digital Library

[8]

A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional-probability, relational models. In Proceedings of the IJCAI Workshop on Learning Statistical Models from Relational Data, 2003.

[9]

I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4(1-2):93--114, 2001.

Digital Library

[10]

K. Probst, M. K. R. Ghai, A. Fano, and Y. Liu. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, pages 2838--2843, 2007.

Digital Library

[11]

J. Rurmo, A. Ageno, and N. Catala. Adaptive information extraction. ACM Computing Surveys, 38(2):Article 4, 2006.

Digital Library

[12]

S. Sarawagi and W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, Neural Information Processing Systems, 2004.

[13]

P. Singla and P. Domingos. Entity resolution with markov logic. In Proceedings of the Sixth IEEE International Conference on Data Mining, pages 572--582, 2006.

Digital Library

[14]

C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fileds: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of Twenty-First International Conference on Machine Learning, pages 783--790, 2004.

Digital Library

[15]

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101:1566--1581, 2006.

[16]

B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional model of information extraction and coreference with application to citation matching. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI), pages 593--601, 2004.

Digital Library

[17]

T.-L. Wong and W. Lam. Adapting web information extraction knowledge via mining site invariant and site dependent features. ACM Transactions on Internet Technology, 7(1):Article 6, 2007.

Digital Library

[18]

H. Zhao, W. Meng, and C. Yu. Mining templates from search result records of search engines. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 884--892, 2007.

Digital Library

[19]

J. Zhu, B. Zhang, Z. Nie, J.-R. Wen, and H.-W. Hon. Webpage understanding: an integrated approach. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 903--912, 2007.

Digital Library

Cited By

Zhang ZYu BLiu TLiu TWang YGuo L(2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583387
Huang CJiang WWu JWang G(2020)Personalized Review Recommendation based on Users’ Aspect SentimentACM Transactions on Internet Technology10.1145/341484120:4(1-26)Online publication date: 6-Oct-2020
https://dl.acm.org/doi/10.1145/3414841
Wu YWang SBezemer CInoue K(2019)How do developers utilize source code from stack overflow?Empirical Software Engineering10.1007/s10664-018-9634-524:2(637-673)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1007/s10664-018-9634-5
Show More Cited By

Recommendations

Normalizing web product attributes and discovering domain ontology with minimal effort
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

We have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an ...
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

This paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically adapting the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

July 2008

934 pages

ISBN:9781605581644

DOI:10.1145/1390334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '08

Sponsor:

SIGIR '08: The 31st Annual International ACM SIGIR Conference

July 20 - 24, 2008

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,326
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZYu BLiu TLiu TWang YGuo L(2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583387
Huang CJiang WWu JWang G(2020)Personalized Review Recommendation based on Users’ Aspect SentimentACM Transactions on Internet Technology10.1145/341484120:4(1-26)Online publication date: 6-Oct-2020
https://dl.acm.org/doi/10.1145/3414841
Wu YWang SBezemer CInoue K(2019)How do developers utilize source code from stack overflow?Empirical Software Engineering10.1007/s10664-018-9634-524:2(637-673)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1007/s10664-018-9634-5
Jin JLiu YJi PKwong C(2018)Review on Recent Advances in Information Mining From Big Consumer Opinion Data for Product DesignJournal of Computing and Information Science in Engineering10.1115/1.404108719:1Online publication date: 17-Sep-2018
https://doi.org/10.1115/1.4041087
Zhang JWang QYang QZhou RZhang Y(2018)Exploiting Multi-Category Characteristics and Unified Framework to Extract Web ContentData Science and Engineering10.1007/s41019-018-0067-33:2(101-114)Online publication date: 7-Jun-2018
https://doi.org/10.1007/s41019-018-0067-3
Liu SLi YFan B(2018)Hierarchical RNN for Few-Shot Information Extraction LearningData Science10.1007/978-981-13-2206-8_20(227-239)Online publication date: 9-Sep-2018
https://doi.org/10.1007/978-981-13-2206-8_20
Chiranjeevi PTeja Santosh DVishnuvardhan B(2018)Survey on Sentiment Analysis Methods for Reputation EvaluationCognitive Informatics and Soft Computing10.1007/978-981-13-0617-4_6(53-66)Online publication date: 12-Aug-2018
https://doi.org/10.1007/978-981-13-0617-4_6
Petrovski PBizer CSheth ANgonga AWang yChang EŚlęzak DFranczyk BAlt RTao X(2017)Extracting attribute-value pairs from product specifications on the webProceedings of the International Conference on Web Intelligence10.1145/3106426.3106449(558-565)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3106426.3106449
Liu SLi YSun GFan BDeng S(2017)Hierarchical RNN Networks for Structured Semantic Web API Model Learning and Extraction2017 IEEE International Conference on Web Services (ICWS)10.1109/ICWS.2017.85(708-713)Online publication date: Jun-2017
https://doi.org/10.1109/ICWS.2017.85
Zhai KKozareva ZHu YLi QGuo WPerego RSebastiani FAslam JRuthven IZobel J(2016)Query to KnowledgeProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911495(255-264)Online publication date: 7-Jul-2016
https://dl.acm.org/doi/10.1145/2911451.2911495
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten