|
ABSTRACT
In a document streaming environment, online detection of the first documents that mention previously unseen events is an open challenge. For this online new event detection (ONED) task, existing studies usually assume that enough resources are always available and focus entirely on detection accuracy without considering efficiency. Moreover, none of the existing work addresses the issue of providing an effective and friendly user interface. As a result, there is a significant gap between the existing systems and a system that can be used in practice. In this paper, we propose an ONED framework with the following prominent features. First, a combination of indexing and compression methods is used to improve the document processing rate by orders of magnitude without sacrificing much detection accuracy. Second, when resources are tight, a resource-adaptive computation method is used to maximize the benefit that can be gained from the limited resources. Third, when the new event arrival rate is beyond the processing capability of the consumer of the ONED system, new events are further filtered and prioritized before they are presented to the consumer. Fourth, implicit citation relationships are created among all the documents and used to compute the importance of document sources. This importance information can guide the selection of document sources. We implemented a prototype of our framework on top of IBM's Stream Processing Core middleware. We also evaluated the effectiveness of our techniques on the standard TDT5 benchmark. To the best of our knowledge, this is the first implementation of a real application in a large-scale stream processing system.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
James Allan , Victor Lavrenko , Hubert Jin, First story detection in TDT is hard, Proceedings of the ninth international conference on Information and knowledge management, p.374-381, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354843]
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
K. Bharat, A. Z. Broder, and J. Dean et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW. IEEE Data Eng. Bull. 23(4): 21--26, 2000.
|
 |
7
|
|
| |
8
|
R. Braun, R. Kaneshiro. Exploiting Topic Pragmatics for New Event Detection in TDT-2004. TDT-2004 Workshop.
|
| |
9
|
|
 |
10
|
|
| |
11
|
F. Chen, A. Farahat, and T. Brants. Story Link Detection and New Event Detection are Asymmetric. HLT-NAACL 2003.
|
| |
12
|
G.M. Corso, A. Gulli, and F. Romani. Ranking a Stream of News. WWW 2005: 97--106.
|
| |
13
|
M. Clayton. US Plans Massive Data Sweep. The Christian Science Monitor, February 09, 2006. http://www.csmonitor.com/2006/0209/p01s02-uspo.html, 2006.
|
| |
14
|
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding Replicated Web Collections. SIGMOD Conf. 2000: 355--366.
|
 |
15
|
|
| |
16
|
Google News Homepage. http://news.google.com, 2006.
|
 |
17
|
Navendu Jain , Lisa Amini , Henrique Andrade , Richard King , Yoonho Park , Philippe Selo , Chitra Venkatramani, Design, implementation, and evaluation of the linear road bnchmark on the stream processing core, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142522]
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
 |
21
|
|
| |
22
|
E. Lipton. Software to Monitor Overseas Opinions of U.S. The New York Times, October 4, 2006. http://news.zdnet.com/2100--9588_22--6122641.html, 2006.
|
 |
23
|
|
| |
24
|
|
| |
25
|
L. Page, S. Brin, and R. Motwani et al. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
| |
26
|
M. F. Porter. An Algorithm for Suffix Stripping. Program 14(3): 130--137, 1980.
|
| |
27
|
|
| |
28
|
S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. TREC 1998: 199--210.
|
 |
29
|
|
| |
30
|
A. Singhal. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4): 35--43, 2001.
|
 |
31
|
|
| |
32
|
SMART Stopword List. http://www.lextek.com/manuals/onix/stopwords2.html, 2005.
|
| |
33
|
C. Tang, S. Dwarkadas. Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval. NSDI 2004: 211--224.
|
| |
34
|
TDT Homepage. http://www.nist.gov/speech/tests/tdt.
|
| |
35
|
TREC Novelty Track. http://trec.nist.gov/tracks.html, 2004.
|
| |
36
|
Yahoo! News Homepage. http://news.yahoo.com, 2006.
|
 |
37
|
|
 |
38
|
Yiming Yang , Jian Zhang , Jaime Carbonell , Chun Jin, Topic-conditioned novelty detection, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775150]
|
 |
39
|
|
| |
40
|
|
CITED BY
|
|
Kun-Lung Wu , Kirsten W. Hildrum , Wei Fan , Philip S. Yu , Charu C. Aggarwal , David A. George , Buǧra Gedik , Eric Bouillet , Xiaohui Gu , Gang Luo , Haixun Wang, Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on System S, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|