skip to main content
10.1145/1076034.1076079acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Title extraction from bodies of HTML documents and its application to web page retrieval

Authors Info & Claims
Published:15 August 2005Publication History

ABSTRACT

This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).

References

  1. Amitay, E., Carmel, D., Darlow, A., Lempel, R., and Soffer, A.Topic Distillation with Knowledge Agents, In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]Google ScholarGoogle Scholar
  2. Breuel, T.M. Information Extraction from HTML Documents by Structural Matching, In Proceedings of the Second International Workshop on Web Document Analysis (WDA2003), 2003.]]Google ScholarGoogle Scholar
  3. Chidlovskii, B., Ragetli, J., and de Rijke, M. Wrapper Generation via Grammar Induction. In Proceedings of the Eleventh European Conference on Machine Learning (ECML2000), 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Collins-Thompson, K., Ogilvie, P., Zhang, Y., and Callan, J. Information Filtering, Novelty Detection, and Named-Page Findiing, In Proceedings of the Eleventh Text Retrieval Conference (TREC-11), 2002.]]Google ScholarGoogle Scholar
  5. Craswell, N. and Hawking, D. Overview of the TREC 2003 Web Track, In Proceedings of the Twelfth Text Retrieval Conference(TREC-2003), 2003.]]Google ScholarGoogle Scholar
  6. Craven, T.C. HTML Tags as Extraction Cues for Web Page Description Construction, Informing Science Journal, Volume 6, 2003.]]Google ScholarGoogle Scholar
  7. Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the Twenty-seventh International Conference on Very Large Databases(VLDB2001), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Crescenzi, V., Mecca, G. and Merialdo, P. Wrapping-Oriented Classification of Web Pages. In Proceedings of the 2002 ACM Symposium on Applied Computing (SAC-2002), pages 1108--1112, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cutler, M., Shih, T. and Meng, Y. Using the Structure of HTML Documents to Improve Retrieval, In Proceedings of the USENIX Symposium on Internet Technologies and Systems (NISTS'97), 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Eikvil, L. Information Extraction from World Wide Web - A Survey. Technical Report 945, 1999.]]Google ScholarGoogle Scholar
  11. Evans, D.K., Klavans, J.L. and McKeown, K.R. Columbia Newsblaster: Multilingual News Summarization on the Web. In Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL-2004), 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Freitag, D. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2/3), pages 169--202, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Freitag, D. and McCallum, A. Information Extraction with HMMs and Shrinkage. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction (AAAI'99), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kosala, R., Bruynooghe, M., Bussche, J.V. and Blockeel, H. Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference, In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-2003), 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Li, Y., Zaragoza, H., Herbrich, R., Shawe-Taylor, J. and Kandola, J. The Perceptron Algorithm with Uneven Margin. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Muslea, I., Minton, S. and Knoblock C. A Hierarchical Approach to Wrapper Induction. In Proceedings of the Third International Conference on Autonomous Agents (Agents'99), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ogilvie, P. and Callan, J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding, In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), 2003.]]Google ScholarGoogle Scholar
  19. Ogilvie, P. and Callan, J. Combining Document Representations for Known-Item Search. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03), 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Reis, D., Golgher, P., Silva, A. and Laender, A. Automatic Web News Extraction Using Tree Edit Distance. In Proceedings of International WWW Conference (WWW-2004), 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Robertson, S., Zaragoza, H. and Taylor, M. Simple BM25 Extension to Multiple Weighted Fields. In Proceedings of ACM Thirteenth Conference on Information and Knowledge Management(CIKM-2004), 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Song, R., Liu, H., Wen, J.-R. and Ma, W.Y. Learning Block Importance Models for Web Pages, In Proceedings of International WWW Conference (WWW-2004), 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Song, R., Wen, J.-R., Shi, S., Xin, G., Liu, T.-Y., Qin, T., Zheng, X., Zhang, J., Xue, G., and Ma, W.-Y. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. In Proceedings of the Thirteenth Text REtrieval Conference Proceedings (TREC-2004), 2004.]]Google ScholarGoogle Scholar
  24. Yau, H.S. and Hawker, J.S. SA_MetaMatch: Relevant Document Discovery Through Document Metadata and Indexing, In Proceedings of ACM Southeast Regional Conference 2004, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zhang, M., Song, R., Lin, C., Ma, L., Jiang, Z., Jin, Y., Liu, Y., Zhao, L. and Ma, S. THU at TREC 2002: novelty, web, and filtering. In Proceedings of the Eleventh Text REtrieval Conference (TREC-11), 2002.]]Google ScholarGoogle Scholar
  26. Zhang, M., Song, R. and Ma, S. DF or IDF? On the use of HTML primary feature fields for Web IR. In Proceedings of the Twelfth International World Wide Web Conference (WWW2003), 2003.]]Google ScholarGoogle Scholar

Index Terms

  1. Title extraction from bodies of HTML documents and its application to web page retrieval

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
                  August 2005
                  708 pages
                  ISBN:1595930345
                  DOI:10.1145/1076034

                  Copyright © 2005 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 15 August 2005

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate792of3,983submissions,20%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader