ACM Home Page
Please provide us with feedback. Feedback
Understanding temporal aspects in document classification
Full text PdfPdf (278 KB)
Source
Web Search and Web Data Mining archive
Proceedings of the international conference on Web search and web data mining table of contents
Palo Alto, California, USA
SESSION: Classification table of contents
Pages 159-170  
Year of Publication: 2008
ISBN:978-1-59593-927-9
Authors
Fernando Mourão  Federal University of Minas Gerais, Belo Horizonte, Brazil
Leonardo Rocha  Federal University of Minas Gerais, Belo Horizonte, Brazil
Renata Araújo  Federal University of Minas Gerais, Belo Horizonte, Brazil
Thierson Couto  Federal University of Minas Gerais, Belo Horizonte, Brazil
Marcos Gonçalves  Federal University of Minas Gerais, Belo Horizonte, Brazil
Wagner Meira, Jr.  Federal University of Minas Gerais, Belo Horizonte, Brazil
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 44,   Downloads (12 Months): 230,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1341531.1341554
What is a DOI?

ABSTRACT

Due to the increasing amount of information present on the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually follows a standard supervised learning strategy, where we first build a model using preclassified documents and then use it to classify new unseen documents. One major challenge for ADC in many scenarios is that the characteristics of the documents and the classes to which they belong may change over time. However, most of the current techniques for ADC are applied without taking into account the temporal evolution of the collection of documents

In this work, we perform a detailed study of the temporal evolution in the ADC, introducing an analysis methodology. We discuss that temporal evolution may be explained by three factors: 1) class distribution; 2) term distribution; and 3) class similarity. We employ metrics and experimental strategies capable of isolating each of these factors in order to analyze them separately, using two very different document collections: the ACM Digital Library and the Medline medical collections. Moreover, we present some preliminary results of potential gains that could be obtained by varying the training set to find the ideal size that minimizes the time effects. We show that by using just 69% of the ACM database, we are able to have an accuracy of 89.76%, and with only 25% of the Medline, an accuracy of 87.57%, which means gains of up to 20% in accuracy with much smaller training sets


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.
3
4
 
5
L. Brieman and P. Spector. Submodel selection and evaluation in regression: The x-random case. International Statistical Review, 60:291--319, 1992.
 
6
7
8
 
9
10
 
11
S. Haykin. Adaptive filters. In Signal Processing Magazine. IEEE Computer Society, 1999.
 
12
13
14
 
15
R. Klinkenberg and I. Renz. Adaptive information filtering: Learning in the presence of concept drifts. In Learning for Text Categorization, pages 33--40, Menlo Park, California, USA, 1998. AAAI Press.
16
 
17
18
19
 
20
 
21
22
23

Collaborative Colleagues:
Fernando Mourão: colleagues
Leonardo Rocha: colleagues
Renata Araújo: colleagues
Thierson Couto: colleagues
Marcos Gonçalves: colleagues
Wagner Meira, Jr.: colleagues