|
ABSTRACT
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to removing clutter or making content more readable involve changing font size or removing HTML and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting", which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction". We have developed a framework that employs easily extensible set of techniques that incorporate advantages of previous work on content extraction. Our key insight is to work with the DOM trees, rather than with raw HTML markup. We have implemented our approach in a publicly available Web proxy to extract content from HTML web pages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aidan Finn, Nicholas Kushmerick and Barry Smyth. "Fact or fiction: Content classification for digital libraries". In Joint DELOS-NSF Workshop on Personalisation and Recommender Systems in Digital Libraries (Dublin), 2001.
|
| |
2
|
A. F. R. Rahman, H. Alam and R. Hartono. "Content Extraction from HTML Documents". In 1st Int. Workshop on Web Document Analysis (WDA2001), 2001.
|
 |
3
|
|
 |
4
|
|
| |
5
|
Eija Kaasinen , Matti Aaltonen , Juha Kolari , Suvi Melakoski , Timo Laakko, Two approaches to bringing Internet services to WAP devices, Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, p.231-246, June 2000, Amsterdam, The Netherlands
|
| |
6
|
Stuart Hanzlik "Gorilla Design Studios Presents: The Hosts File". Gorilla Design Studios. August 31, 2002. http://accs-net.com/hosts/.
|
 |
7
|
|
| |
8
|
K.R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, M.Y. Kan, B. Schiffman and S. Teufel. "Columbia Multi-document Summarization: Approach and Evaluation", In Document Understanding Conf., 2001.
|
 |
9
|
|
 |
10
|
|
| |
11
|
Manuela Kunze and Dietmar Rosner. "An XML-based Approach for the Presentation and Exploitation of Extracted Information". In 19th International Conference on Computational Linguistics, (Coling) 2002.
|
| |
12
|
A. F. R. Rahman, H. Alam and R. Hartono. "Understanding the Flow of Content in Summarizing HTML Documents". In Int. Workshop on Document Layout Interpretation and its Applications, DLIA01, Sep., 2001.
|
| |
13
|
Wolfgang Reichl, Bob Carpenter, Jennifer Chu-Carroll and Wu Chou. "Language Modeling for Content Extraction in Human-Computer Dialogues". In ICSLP, what year?. What in the world is ICSLP??
|
 |
14
|
|
| |
15
|
Min-Yen Kan, Judith L. Klavans and Kathleen R. McKeown. "Linear Segmentation and Segment Relevance". In Proc. of 6th Int. Workshop of Very Large Corpora (WVLC-6), 1998.
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
| |
22
|
Private communication, Min-Yen Kan, Columbia NLP group, 2002.
|
CITED BY 16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ruihua Song , Haifeng Liu , Ji-Rong Wen , Wei-Ying Ma, Learning block importance models for web pages, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
Patrick Baudisch , Xing Xie , Chong Wang , Wei-Ying Ma, Collapse-to-zoom: viewing web pages on small screen devices by interactively removing irrelevant content, Proceedings of the 17th annual ACM symposium on User interface software and technology, October 24-27, 2004, Santa Fe, NM, USA
|
|
|
|
Unmil P. Karadkar , Richard Furuta , Selen Ustun , YoungJoo Park , Jin-Cheon Na , Vivek Gupta , Tolga Ciftci , Yungah Park, Display-agnostic hypermedia, Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, August 09-13, 2004, Santa Cruz, CA, USA
|
|
Frank Allan Hansen , Niels Olof Bouvin , Bent G. Christensen , Kaj Grønbæk , Torben Bach Pedersen , Jevgenij Gagach, Integrating the web and the world: contextual trails on the move, Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, August 09-13, 2004, Santa Cruz, CA, USA
|
|
|
|
|
|
|
Tien Nhut Nguyen , Ethan Vincent Munson , Cheng Thao, Fine-grained, structured configuration management for web projects, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|