No abstract available.
Salton Award lecture: on theoretical argument in information retrieval (summary only): on theoretical argument in information retrieval
The last winner of the Salton Award, Tefko Saracevic, gave an acceptance address at SIGIR in Philadelphia in 1997. Previous winners were William Cooper (1994), Cyril Cleverdon (1991), Karen Sparck Jones (1988) and Gerard Salton himself (1985).
In this ...
Relevance and contributing information types of searched documents in task performance
End-users base the relevance judgements of the searched documents on the expected contribution to their task of the information contained in the documents. There is a shortage of studies analyzing the relationships between the experienced contribution, ...
Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering
The use of incremental relevance feedback and document clustering were investigated in an relevance feedback environment in which the number of relevance judgements was quite small. Through experiments on the TREC collection, the incremental relevance ...
Do batch and user evaluations give the same results?
Do improvements in system performance demonstrated by batch evaluations confer the same benefit for real users? We carried out experiments designed to investigate this question. After identifying a weighting scheme that gave maximum improvement over the ...
A novel method for the evaluation of Boolean query effectiveness across a wide operational range
Traditional methods for the system-oriented evaluation of Boolean IR system suffer from validity and reliability problems. Laboratory-based research neglects the searcher and studies suboptimal queries. Research on operational systems fails to make a ...
Evaluating evaluation measure stability
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good ...
IR evaluation methods for retrieving highly relevant documents
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
Automatic generation of overview timelines
We present a statistical model of feature occurrence over time, and develop tests based on classical hypothesis testing for significance of term appearance on a given date. Using additional classical hypothesis testing we are able to combine these terms ...
Event tracking based on domain dependency
This paper proposes a method for event tracking on broadcast news stories based on distinction between a topic and an event. A topic and an event are identified using a simple criterion called domain dependency of words: how greatly a word features a ...
Improving text categorization methods for event tracking
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the ...
Evaluation of a simple and effective music information retrieval method
We developed, and then evaluated, a music information retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. The songs were converted to an interval-only representation of monophonic melodies and ...
Phonetic confusion matrix based spoken document retrieval
Combined word-based index and phonetic indexes have been used to improve the performance of spoken document retrieval systems primarily by addressing the out-of-vocabulary retrieval problem. However, a known problem with phonetic recognition is its ...
Multiple evidence combination in image retrieval: Diogenes searches for people on the Web
In this work, we examine evidence combination mechanisms for classifying multimedia information. In particular, we examine linear and Dempster-Shafer methods of evidence combination in the context of identifying personal images on the World Wide Web. An ...
Link-based and content-based evidential information in a belief network model
This work presents an information retrieval model developed to deal with hyperlinked environments. The model is based on belief networks and provides a framework for combining information extracted from the content of the documents with information ...
The feature quantity: an information theoretic perspective of Tfidf-like measures
The feature quantity, a quantitative representation of specificity introduced in this paper, is based on an information theoretic perspective of co-occurrence events between terms and documents. Mathematically, the feature quantity is defined as a ...
INSYDER — an information assistant for business intelligence
The WWW is the most important resource for external business information. This paper presents a tool called INSYDER, an information assistant for finding and analysis business information from the WWW. INSYDER is a system using different agents for ...
Structured translation for cross-language information retrieval
The paper introduces a query translation model that reflects the structure of the cross-language information retrieval task. The model is based on a structured bilingual dictionary in which the translations of each term are clustered into groups with ...
Automatic adaptation of proper noun dictionaries through cooperation of machine learning and probabilistic methods
- Georgios Petasis,
- Alessandro Cucchiarelli,
- Paola Velardi,
- Georgios Paliouras,
- Vangelis Karkaletsis,
- Constantine D. Spyropoulos
The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the availability of large dictionaries of domain-...
Document centered approach to text normalization
In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of ...
OCELOT: a system for summarizing Web pages
We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure ...
Extracting sentence segments for text summarization: a machine learning approach
With the proliferation of the Internet and the huge amount of data it transfers, text summarization is becoming more important. We present an approach to the design of an automatic text summarizer that generates a summary by extracting sentence ...
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has ...
Text filtering by boosting naive Bayes classifiers
Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are ...
Document filtering method using non-relevant information profile
Document filtering is a task to retrieve documents relevant to a user's profile from a flow of documents. Generally, filtering systems calculate the similarity between the profile and each incoming document, and retrieve documents with similarity higher ...
Question-answering by predictive annotation
We present a new technique for question answering called Predictive Annotation. Predictive Annotation identifies potential answers to questions in text, annotates them accordingly and indexes them. This technique, along with a complementary analysis of ...
Bridging the lexical chasm: statistical approaches to answer-finding
This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and ...
Building a question answering test collection
The TREC-8 Question Answering (QA) Track was the first large-scale evaluation of domain-independent question answering systems. In addition to fostering research on the QA task, the track was used to investigate whether the evaluation methodology used ...
Document clustering using word clusters via the information bottleneck method
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained ...
Latent semantic space: iterative scaling improves precision of inter-document similarity measurement
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an ...
An investigation of linguistic features and clustering algorithms for topical document clustering
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical ...
Index Terms
- Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval