|
ABSTRACT
Multidimensional statistical models are generally computed outside a relational DBMS, exporting data sets. This article explains how fundamental multidimensional statistical models are computed inside the DBMS in a single table scan exploiting SQL and User-Defined Functions (UDFs). The techniques described herein are used in a commercial data mining tool, called Teradata Warehouse Miner. Specifically, we explain how correlation, linear regression, PCA and clustering, are integrated into the Teradata DBMS. Two major database processing tasks are discussed: building a model and scoring a data set based on a model. To build a model two summary matrices are shown to be common and essential for all linear models: the linear sum of points and the quadratic sum of cross-products of points. Since such matrices are generally significantly smaller than the data set, we explain how the remaining matrix operations to build the model can be quickly performed outside the DBMS. We first explain how to efficiently compute summary matrices with plain SQL queries. Then we present two sets of UDFs that work in a single table scan: an aggregate UDF to compute summary matrices and a set of scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (running outside on exported files). In general, UDFs are faster than SQL queries and UDFs are more efficient than C++, due to long export times. Statistical models based on the summary matrices can be built outside the DBMS in just a few seconds. Aggregate and scalar UDFs scale linearly and require only one table scan, making them ideal to process large data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, pages 9--15, 1998.
|
 |
4
|
|
 |
5
|
John Clear , Debbie Dunn , Brad Harvey , Michael Heytens , Peter Lohman , Abhay Mehta , Mark Melton , Lars Rohrberg , Ashok Savasere , Robert Wehrmeister , Melody Xu, NonStop SQL/MX primitives for knowledge discovery, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.425-429, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312309]
|
| |
6
|
R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. Addison/Wesley, Redwood City, California, 3rd edition, 2000.
|
 |
7
|
Johannes Gehrke , Venkatesh Ganti , Raghu Ramakrishnan , Wei-Yin Loh, BOAT—optimistic decision tree construction, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.169-180, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
8
|
A. Ghazal, A. Crolotte, and R. Bhashyam. Outer join elimination in the Teradata RDBMS. In DEXA Conference, pages 730--740, 2004.
|
| |
9
|
G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large SQL databases. In ACM KDD Conference, pages 204--208, 1998.
|
| |
10
|
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001.
|
| |
11
|
A. Hinneburg, D. Habich, and W. Lehner. Combi-operator-database support for data mining applications. In VLDB Conference, pages 429--439, 2003.
|
 |
12
|
|
| |
13
|
|
| |
14
|
A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, 2001.
|
| |
15
|
|
 |
16
|
|
| |
17
|
C. Ordonez and J. García-García. Vector and matrix operations programmed with UDFs in a relational DBMS. In ACM CIKM Conference, 2006.
|
| |
18
|
|
 |
19
|
Sunita Sarawagi , Shiby Thomas , Rakesh Agrawal, Integrating association rule mining with relational database systems: alternatives and implications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.343-354, June 01-04, 1998, Seattle, Washington, United States
|
 |
20
|
|
 |
21
|
Andrew Witkowski , Srikanth Bellamkonda , Tolga Bozkaya , Gregory Dorman , Nathan Folkert , Abhinav Gupta , Lei Shen , Sankar Subramanian, Spreadsheets in RDBMS for OLAP, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California
[doi> 10.1145/872757.872767]
|
 |
22
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|