ABSTRACT
We have developed an unsupervised framework for simultaneously extracting and normalizing attributes of products from multiple Web pages originated from different sites. Our framework is designed based on a probabilistic graphical model that can model the page-independent content information and the page-dependent layout information of the text fragments in Web pages. One characteristic of our framework is that previously unseen attributes can be discovered from the clue contained in the layout format of the text fragments. Our framework tackles both extraction and normalization tasks by jointly considering the relationship between the content and layout information. Dirichlet process prior is employed leading to another advantage that the number of discovered product attributes is unlimited. An unsupervised inference algorithm based on variational method is presented. The semantics of the normalized attributes can be visualized by examining the term weights in the model. Our framework can be applied to a wide range of Web mining applications such as product matching and retrieval. We have conducted extensive experiments from four different domains consisting of over 300 Web pages from over 150 different Web sites, demonstrating the robustness and effectiveness of our framework.
- I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, pages 47--58, 2006.Google ScholarCross Ref
- M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48, 2003. Google ScholarDigital Library
- D. Blei and M. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1):121--144, 2006.Google ScholarCross Ref
- S.-L. Chuang, K. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In Proceedings of the Thirty-Third Very Large Databases Conference, pages 699--710, 2007. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proceedings of the Twenty-Seventh Very Large Databases Conference, pages 109--118, 2001. Google ScholarDigital Library
- J. Ishwaran and L. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161--174, 2001.Google ScholarCross Ref
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
- A. McCallum and D. Jensen. A note on the unification of information extraction and data mining using conditional-probability, relational models. In Proceedings of the IJCAI Workshop on Learning Statistical Models from Relational Data, 2003.Google Scholar
- I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4(1-2):93--114, 2001. Google ScholarDigital Library
- K. Probst, M. K. R. Ghai, A. Fano, and Y. Liu. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, pages 2838--2843, 2007. Google ScholarDigital Library
- J. Rurmo, A. Ageno, and N. Catala. Adaptive information extraction. ACM Computing Surveys, 38(2):Article 4, 2006. Google ScholarDigital Library
- S. Sarawagi and W. Cohen. Semi-markov conditional random fields for information extraction. In Advances in Neural Information Processing Systems 17, Neural Information Processing Systems, 2004.Google Scholar
- P. Singla and P. Domingos. Entity resolution with markov logic. In Proceedings of the Sixth IEEE International Conference on Data Mining, pages 572--582, 2006. Google ScholarDigital Library
- C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random fileds: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of Twenty-First International Conference on Machine Learning, pages 783--790, 2004. Google ScholarDigital Library
- Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101:1566--1581, 2006.Google ScholarCross Ref
- B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional model of information extraction and coreference with application to citation matching. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI), pages 593--601, 2004. Google ScholarDigital Library
- T.-L. Wong and W. Lam. Adapting web information extraction knowledge via mining site invariant and site dependent features. ACM Transactions on Internet Technology, 7(1):Article 6, 2007. Google ScholarDigital Library
- H. Zhao, W. Meng, and C. Yu. Mining templates from search result records of search engines. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 884--892, 2007. Google ScholarDigital Library
- J. Zhu, B. Zhang, Z. Nie, J.-R. Wen, and H.-W. Hon. Webpage understanding: an integrated approach. In Proceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 903--912, 2007. Google ScholarDigital Library
Recommendations
Normalizing web product attributes and discovering domain ontology with minimal effort
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningWe have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an ...
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide WebIn this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Extracting Attributes and Synonymous Attributes from Online Encyclopedias
WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01In this paper, we present an approach that extracts attributes of open-domain named entities for the Chinese language. The approach contains two steps. The first step consists in an unsupervised technique which captures high frequency attributes from ...
Comments