poster

Web news categorization using a cross-media document graph

Authors:
José Iria

The University of Sheffield, Sheffield, UK

The University of Sheffield, Sheffield, UK
View Profile

,
Fabio Ciravegna

The University of Sheffield, Sheffield, UK

The University of Sheffield, Sheffield, UK
View Profile

,
João Magalhães

Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal

Instituto Superior de Engenharia de Lisboa, Lisbon, Portugal
View Profile

CIVR '09: Proceedings of the ACM International Conference on Image and Video RetrievalJuly 2009Article No.: 27Pages 1–8https://doi.org/10.1145/1646396.1646431

Published:08 July 2009Publication History

CIVR '09: Proceedings of the ACM International Conference on Image and Video Retrieval

Pages 1–8

ABSTRACT

In this paper we propose a multimedia categorization framework that is able to exploit information across different parts of a multimedia document (e.g., a Web page, a PDF, a Microsoft Office document). For example, a Web news page is composed by text describing some event (e.g., a car accident) and a picture containing additional information regarding the real extent of the event (e.g., how damaged the car is) or providing evidence corroborating the text part. The framework handles multimedia information by considering not only the document's text and images data but also the layout structure which determines how a given text block is related to a particular image. The novelties and contributions of the proposed framework are: (1) support of heterogeneous types of multimedia documents; (2) a document-graph representation method; and (3) the computation of cross-media correlations. Moreover, we applied the framework to the tasks of categorising Web news feed data, and our results show a significant improvement over a single-medium based framework.

References

A. Arasu and A. H. Garcia-Molina, "Extracting structured data from Web pages " in ACM SIGMOD conf. on management of data San Diego, California 2003. Google ScholarDigital Library
K. Barnard and D. A. Forsyth, "Learning the semantics of words and pictures," in Int'l Conf. on Computer Vision. vol. 2 Vancouver, Canada, 2001, pp. 408--415.Google Scholar
A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co-training," in Computational Learning Theory Madison, WI, USA, 1998. Google ScholarDigital Library
T. M. Breuel, "Information extraction from HTML document by structural matching," in Int'l Workshop on Web Document Analysis Edinburgh, UK, 2003, pp. 11--14.Google Scholar
M. L. Cascia, S. Sethi, and S. Sclaroff, "Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web," in IEEE Workshop on Content-based Access of Image and Video Libraries with the IEEE Conf. on Vision and Pattern Recognition Santa Barbara, California, 1998. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo, "RoadRunner: towards automatic data extraction from large Web sites," in Int'l Conference on Very Large Data Bases, 2001. Google ScholarDigital Library
A.-S. Dadzie, R. Bhagdev, A. Chakravarthy, S. Chapman, J. Iria, V. Lanfranchi, J. Magalhães, D. Petrelli, and F. Ciravegna, "Applying Semantic Web technologies to knowledge sharing in aerospace engineering," Journal of Industrial Manufacturing.Google Scholar
L. Denoyer and P. Gallinari, "Bayesian network model for semi-structured document classification," Information Processing and Management, vol. 40, pp. 807--827, June 2004. Google ScholarDigital Library
L. Denoyer, P. Gallinari, J.-N. Vittaut, S. Brunesseaux, and S. Brunesseaux, "Structured multimedia document classification," in ACM DOCENG Grenoble, France, 2003. Google ScholarDigital Library
S. Ebadollahi, L. Xie, S.-F. Chang, and J. R. Smith, "Visual Event Detection Using Multi-Dimensional Concept Dynamics," in IEEE International Conference on Multimedia and Expo Toronto, Canada, 2006.Google Scholar
Y. Feng and M. Lapata, "Automatic Image Annotation Using Auxiliary Text Information," in ACL HLT Columbus, Ohio, USA, 2008.Google Scholar
A. Haubold and A. Natsev, "Web-based information content and its application to concept-based video retrieval," in ACM Conf. on Image and Video Retrieval Niagara Falls, Canada, 2008. Google ScholarDigital Library
P. Howarth and S. Rüger, "Evaluation of texture features for content-based image retrieval," in Int'l Conf. on Image and Video Retrieval Dublin, Ireland, 2004, pp. 326--324.Google Scholar
D. Joshi, M. Naphade, and A. Natsev, "Semantics reinforcement and fusion learning for multimedia streams," in ACM international conference on Image and video retrieval Amsterdam, The Netherlands, 2007. Google ScholarDigital Library
A. H. F. Laender, B. A. Ribeiro-Neto, A. S. d. Silva, and J. S. Teixeira, "A brief survey of Web data extraction tools," ACM SIGMOD Record, vol. 31 pp. 84--93. Google ScholarDigital Library
G. Maderlechner and P. Suda, "Information extraction from document images using white space and graphics analysis," in Joint IAPR Int'l Workshop on Advances in Pattern Recognition, 1998, pp. 468--474. Google ScholarDigital Library
J. Magalhães and S. Rüger, "Information-theoretic semantic multimedia indexing," in ACM Conf. on Image and Video Retrieval Amsterdam, The Netherlands, 2007. Google ScholarDigital Library
C. Manning and H. Schütze, Foundations of statistical natural language processing. Cambridge, MA: MIT Press, May 1999. Google ScholarDigital Library
B. Rosenfeld, R. Feldman, and J. Aumann, "Structural extraction from visual layout of documents," in ACM Conf. on CIKM McLean, Virginia, USA 2002. Google ScholarDigital Library
F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, pp. 1--47. Google ScholarDigital Library
C. Shin and D. Doermann, "Classification of document page images based on visual similarity on layout structures," in SPIE Vol. 3967, Document Recognition and Retrieval VII San Jose, California, 2000, pp. 182--190.Google Scholar
Y. Yang, "An evaluation of statistical approaches to text categorization," Information Retrieval, pp. 69--90. Google ScholarDigital Library
S. Yu, D. Cai, J.-R. Wen, and W.-Y. Ma, "Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation," in World Wide Web Budapest, Hungary, 2003. Google ScholarDigital Library
X. S. Zhou, S. Zillner, M. Moeller, M. Sintek, Y. Zhan, A. Krishnan, and A. Gupta, "Semantics and CBIR: a medical imaging perspective," in ACM Conf. on Image and Video Retrieval Niagara Falls, Canada, 2008. Google ScholarDigital Library
Y. Zhuang, H. Shan, and F. Wu, "An approach for cross-media retrieval with cross-reference graph and PageRank," in International Conference on Multi-Media Modelling Beijing, China, 2006Google Scholar

Index Terms

Web news categorization using a cross-media document graph
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information storage systems
    1. Record storage systems
      1. Record storage alternatives

Recommendations

Frequent pattern-growth approach for document organization
ONISW '08: Proceedings of the 2nd international workshop on Ontologies and information systems for the semantic web

In this paper, we propose a document clustering mechanism that depends on the appearance of frequent senses in the documents rather than on the co-occurrence of frequent keywords. Instead of representing each document as a collection of keywords, we use ...
Read More
Categorisation of web documents using extraction ontologies

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Read More
Flexible document categorisation
AIKED'05: Proceedings of the 4th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering Data Bases

In the context of automatic document categorization, we propose in this paper a new flexible approach for electronic document categorization situated in junction of knowledge engineering and learning machine approaches. Our approach assigns a HTML ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIVR '09: Proceedings of the ACM International Conference on Image and Video Retrieval
July 2009
383 pages
ISBN:9781605584805
DOI:10.1145/1646396
Conference Chairs:
Yiannis Kompatsiaris
CERTH-ITI, Greece
,
Stephane Marchand-Maillet
Univ. of Geneva, Switzerland
,
Program Chairs:
Yannis Avrithis
NTUA, Greece
,
Noel O Connor
DCU, Ireland
,
Daniel Gatica-Perez
Idiap Research Institute, Switzerland
,
Tat-Seng Chua
National University of Singapore, Singapore
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-media correlations
cross-media documents
document-graph
web news categorization
Qualifiers
- poster
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 247
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web news categorization using a cross-media document graph

CIVR '09: Proceedings of the ACM International Conference on Image and Video Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Frequent pattern-growth approach for document organization

Categorisation of web documents using extraction ontologies

Flexible document categorisation