ACM Home Page
Please provide us with feedback. Feedback
Modeling and managing changes in text databases
Full text PdfPdf (622 KB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 32 ,  Issue 3  (August 2007) table of contents
Article No. 14  
Year of Publication: 2007
ISSN:0362-5915
Authors
Panagiotis G. Ipeirotis  New York University, New York, NY
Alexandros Ntoulas  Microsoft Search Labs, Mountain View, CA
Junghoo Cho  University of California, Los Angeles, CA
Luis Gravano  Columbia University, New York, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 424,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1272743.1272744
What is a DOI?

ABSTRACT

Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Bergman, M. K. 2001. The deep web: Surfacing hidden value. J. Electron. Pub. 7, 1 (Aug.).
 
2
 
3
 
4
Callan, J. P. 2000. Distributed information retrieval. In Adv. Inf. Retriev. Kluwer Academic Publishers, 127--150.
5
 
6
Chakrabarti, S. 2002. Mining the web. Morgan-Kaufmann, San Francisco, CA.
 
7
8
9
 
10
 
11
Coffman, Jr., E. G., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.
 
12
Cox, D. R. 1972. Regression models and life-tables (with discussion). J. Roy. Stat. Soc. B, 34, 187--220.
 
13
 
14
15
16
17
18
19
 
20
Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer-Verlag, New York.
 
21
 
22
 
23
 
24
 
25
Marques De Sá, J. P. 2003. Applied Statistics. Springer-Verlag, New York.
 
26
Moré, J. J. 1977. The Levenberg-Marquardt algorithm: Implementation and theory. In Numerical Analysis, Lecture Notes in Mathematics vol. 630, Springer-Verlag, New York. 105--116.
27
28
29
30
 
31
Stablein, D. M., Carter, Jr., W. H., and Novak, J. W. 1981. Analysis of survival data with nonproportional hazard functions. Cont. Clin. Trials 2, 2 (June), 149--159.
 
32
33

Collaborative Colleagues:
Panagiotis G. Ipeirotis: colleagues
Alexandros Ntoulas: colleagues
Junghoo Cho: colleagues
Luis Gravano: colleagues