ACM Home Page
Please provide us with feedback. Feedback
Structure choices for two-dimensional histogram construction
Full text PdfPdf (340 KB)
Source IBM Centre for Advanced Studies Conference archive
Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research table of contents
Markham, Ontario, Canada
Pages: 13 - 27  
Year of Publication: 2004
ISSN:1705-7345
Authors
Hang T. A. Pham  Department of Computer Science, University of Toronto
Kenneth C. Sevcik  Department of Computer Science, University of Toronto
Sponsors
NRC : National Research Council - Canada
: IBM Toronto Laboratory
: IBM Centre for Advanced Studies (CAS)
Publisher
IBM Press 
Bibliometrics
Downloads (6 Weeks): 2,   Downloads (12 Months): 30,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   

ABSTRACT

Histograms of the distributions of individual attributes are currently used in leading database management systems (e.g., IBM DB2, Oracle Database, and Microsoft SQL server). Because attribute pairs in databases are seldom independent, however, the use of the distributions of individual attributes with the attribute independence assumption often leads to poor estimates. More accurate answers can be obtained by using multi-dimensional histograms to characterize the joint distribution of two or more attributes.

When moving from one-dimensional to two-dimensional histograms, several new issues relating to histogram structure arise: (1) Which attribute should take priority over the other with respect to data partitioning?; (2) Into how many partitions should each dimension be split to obtain a desired number of histogram buckets?; and (3) How many most frequent values should be isolated and stored in singleton buckets?

In the context of real data, we experimentally show that our proposed methods for dealing with histogram structure choices lead to good quality histograms for a variety of histogram partitioning techniques and various types of data distributions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
 
3
{3} D. Barbara, W. DuMouchel, C. Faloutsos, P.J. Haas, J.M. Hellerstein, Y. Ioannidis, H.V. Jagadish, T. Johnson, R. Ng, V. Poosala, K.A. Ross, K.C. Sevcik. The New Jersey data reduction report. Data Engineering Bulletin, vol. 20, no. 4, pages 3-45, 1997.
4
 
5
{5} Census database repository, U. S. Census Bureau. Available at http://www.census.gov.
 
6
7
 
8
9
10
 
11
 
12
 
13
14
15
16
 
17
 
18
{18} Hang T. A. Pham. Accurate two-dimensional histograms for fast approximate answers to queries on real data. M.S. thesis. University of Toronto, March 2004. Available at http://www.cs.toronto.edu/~hangp/thesis.html
19
 
20
{20} V. Poosala, V. Ganti, Y. Ioannidis. Approximate Query answering using histograms. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 22, no. 4, pages 5-14, 1999.
21
 
22
 
23
 
24
{24} UCI KDD database repository. Available at http://kdd.ics.uci.edu.
 
25
 
26

INDEX TERMS

Primary Classification:
  H. Information Systems
  H.2 DATABASE MANAGEMENT
      H.2.1 Logical Design
          Subjects: Data models

Additional Classification:
  H. Information Systems
  H.2 DATABASE MANAGEMENT
      H.2.1 Logical Design
          Subjects: Schema and subschema
      H.2.3 Languages

          Nouns: DB2
      H.2.4 Systems

          Nouns: Microsoft SQL Server


General Terms:
Algorithms, Performance

Collaborative Colleagues:
Hang T. A. Pham: colleagues
Kenneth C. Sevcik: colleagues