skip to main content
10.1145/3106426.3106449acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
research-article

Extracting attribute-value pairs from product specifications on the web

Published: 23 August 2017 Publication History

Abstract

Comparison shopping portals integrate product offers from large numbers of e-shops in order to support consumers in their buying decisions. Product offers often consist of a title and a free-text product description, both describing product attributes that are considered relevant by the specific vendor. In addition, product offers might contain structured or semi-structured product specifications in the form of HTML tables and HTML lists. As product specifications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specifications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation.
In this paper, we present an approach for extracting attribute-value pairs from product specifications on the Web. We use supervised learning to classify the HTML tables and HTML lists within a web page as product specification or not. In order to extract attribute-value pairs from the HTML fragments identified by the specification detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEXTER, the current state-of-the-art approach for extracting attribute-value pairs from product specifications, we introduce several new features for specification detection and support the extraction of attribute-value pairs from specifications having more than two columns. This allows us to improve the F-score up to 10% for extracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 different e-shops. This experiment confirms the suitability of duplicate-based schema matching for product data integration.

References

[1]
Lidong Bing, Tak-Lam Wong, and Wai Lam. 2016. Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews. ACM Trans. Internet Technol. 16, 2, Article 12 (April 2016), 17 pages.
[2]
Oren Etzioni, Rattapoom Tuchinda, Craig A. Knoblock, and Alexander Yates. 2003. To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03). ACM, New York, NY, USA, 119--128.
[3]
Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano. 2006. Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter 8, 1 (2006), 41--48.
[4]
Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Rajeev Rastogi, and Srinivasan Sengamedu. 2012. Matching product titles using web-based enrichment. In 21st ACM international conference on Information and knowledge management. 605--614.
[5]
Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in Your Inbox: Product Recommendations at Scale. In Proceedings of the 21th ACM SIGKDD. ACM, 1809--1818.
[6]
Anitha Kannan, Inmar E Givoni, Rakesh Agrawal, and Ariel Fuxman. 2011. Matching unstructured product offers to structured product specifications. In 17th ACM SIGKDD.
[7]
M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. 2015. Where to Buy It: Matching Street Clothing Photos in Online Shops. In 2015 IEEE International Conference on Computer Vision (ICCV). 3343--3351.
[8]
Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. 2003. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Physica-Verlag HD, Heidelberg, 275--287.
[9]
Hanna Köpcke, Andreas Thor, Stefan Thomas, and Erhard Rahm. 2012. Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology. ACM, 545--550.
[10]
Zornitsa Kozareva. 2015. Everyone Likes Shopping! Multi-class Product Categorization for e-Commerce. In The 2015 Annual Conference of the North Americal Chapter for the ACL. 1329--1333.
[11]
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, and Christian Bizer. 2015. The Mannheim Search Join Engine. Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015), 159 - 166. Semantic Web Challenge 2014.
[12]
Nikhil Londhe, Vishrawas Gopalakrishnan, Aidong Zhang, Hung Q Ngo, and Rohini Srihari. 2014. Matching titles with cross title web-search enrichment and community detection. Proceedings of the VLDB Endowment 7, 12 (2014), 1167--1178.
[13]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference. ACM, 43--52.
[14]
Gabor Melli. 2014. Shallow Semantic Parsing of Product Offering Titles (for better automatic hyperlink insertion). In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1670--1678.
[15]
Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The webdatacommons microdata, RDFa and microformat dataset series. In The Semantic Web-IS WC. 277--292.
[16]
Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, and Rakesh Agrawal. 2011. Synthesizing products for online catalogs. Proceedings of the VLDB Endowment 4, 7 (2011), 409--418.
[17]
Stefano Ortona. 2014. An analysis of duplicate on web extracted objects. In Proceedings of the companion publication of the 23rd international conference on World wide web companion. 1279--1284.
[18]
Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Integrating product data from websites offering microdata markup. In Proceedings of the companion publication of the 23rd international conference on World wide web companion. 1299--1304.
[19]
Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Learning Regular Expressions for the Extraction of Product Attributes from E-commerce Microdata. (2014).
[20]
Petar Petrovski, Anna Primpeli, Robert Meusel, and Christian Bizer. 2017. The WDC Gold Standards for Product Feature Extraction and Product Matching. Springer International Publishing, Cham, 73--86.
[21]
Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8, 13 (2015), 2194--2205.
[22]
Daniel Rinser, Dustin Lange, and Felix Naumann. 2013. Cross-lingual Entity Matching and Infobox Alignment in Wikipedia. Inf. Syst. 38, 6 (Sept. 2013), 887--907.
[23]
Petar Ristoski and Peter Mika. 2016. Enriching Product Ads with Metadata from HTML Annotations. In Proceedings of the 13th Extended Semantic Web Conference. (To Appear).
[24]
Ronald van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, and Flavius Frasincar. 2015. Multi-component Similarity Method for Web Product Duplicate Detection. (2015).
[25]
Damir Vandic, Jan-Willem Van Dam, and Flavius Frasincar. 2012. Faceted product search powered by the Semantic Web. Decision Support Systems 53, 3 (2012), 425--437.
[26]
Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, and Yu-Gang Jiang. 2016. Matching User Photos to Online Products with Robust Deep Features. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR '16). ACM, New York, NY, USA, 7--14.
[27]
Tak-Lam Wong, Wai Lam, and Tik-Shun Wong. 2008. An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08). ACM, New York, NY, USA, 35--42.
[28]
W. X. Zhao, S. Li, Y. He, E. Chang, J. R. Wen, and X. Li. 2015. Connecting Social Media to E-Commerce: Cold-Start Product Recommendation On Microblogs. IEEE Transactions on Knowledge and Data Engineering PP, 99 (2015), 1--1.

Cited By

View all
  • (2022)PAVE: Lazy-MDP based Ensemble to Improve Recall of Product Attribute Extraction ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557119(3233-3242)Online publication date: 17-Oct-2022
  • (2022)Extraction of Product Specifications from the Web - Going Beyond Tables and ListsProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493713(19-27)Online publication date: 8-Jan-2022
  • (2022)MAVEProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498377(1256-1265)Online publication date: 11-Feb-2022
  • Show More Cited By

Index Terms

  1. Extracting attribute-value pairs from product specifications on the web

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WI '17: Proceedings of the International Conference on Web Intelligence
    August 2017
    1284 pages
    ISBN:9781450349512
    DOI:10.1145/3106426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 August 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. feature extraction
    2. product data
    3. schema matching
    4. web tables

    Qualifiers

    • Research-article

    Conference

    WI '17
    Sponsor:

    Acceptance Rates

    WI '17 Paper Acceptance Rate 118 of 178 submissions, 66%;
    Overall Acceptance Rate 118 of 178 submissions, 66%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)PAVE: Lazy-MDP based Ensemble to Improve Recall of Product Attribute Extraction ModelsProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557119(3233-3242)Online publication date: 17-Oct-2022
    • (2022)Extraction of Product Specifications from the Web - Going Beyond Tables and ListsProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493713(19-27)Online publication date: 8-Jan-2022
    • (2022)MAVEProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining10.1145/3488560.3498377(1256-1265)Online publication date: 11-Feb-2022
    • (2022)What Matters for Shoppers: Investigating Key Attributes for Online Product ComparisonAdvances in Information Retrieval10.1007/978-3-030-99739-7_27(231-239)Online publication date: 5-Apr-2022
    • (2021)Automatic Form Filling with Form-BERTProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463063(1850-1854)Online publication date: 11-Jul-2021
    • (2021)DiffXtract: Joint Discriminative Product Attribute-Value Extraction2021 IEEE International Conference on Big Knowledge (ICBK)10.1109/ICKG52313.2021.00044(271-280)Online publication date: Dec-2021
    • (2020)Learning to Extract Attribute Value from Product via Question Answering: A Multi-task ApproachProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403047(47-55)Online publication date: 23-Aug-2020
    • (2019)The WDC Training Dataset and Gold Standard for Large-Scale Product MatchingCompanion Proceedings of The 2019 World Wide Web Conference10.1145/3308560.3316609(381-386)Online publication date: 13-May-2019
    • (2019)End-to-End Product Taxonomy Extension from Text Reviews2019 IEEE 13th International Conference on Semantic Computing (ICSC)10.1109/ICOSC.2019.8665533(195-198)Online publication date: Jan-2019
    • (2019)Accurate Product Attribute Extraction on the Field2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00202(1862-1873)Online publication date: Apr-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media