skip to main content
article

Extrapolation errors in linear model trees

Published: 01 August 2007 Publication History

Abstract

Prediction errors from a linear model tend to be larger when extrapolation is involved, particularly when the model is wrong. This article considers the problem of extrapolation and interpolation errors when a linear model tree is used for prediction. It proposes several ways to curtail the size of the errors, and uses a large collection of real datasets to demonstrate that the solutions are effective in reducing the average mean squared prediction error. The article also provides a proof that, if a linear model is correct, the proposed solutions have no undesirable effects as the training sample size tends to infinity.

References

[1]
Aaberge, R., Colombino, U., and Strom, S. 1999. Labor supply in Italy: An empirical analysis of joint household decisions, with taxes and quantity constraints. J. Appl. Econom. 14, 403--422.
[2]
Afifi, A. and Azen, S. 1979. Statistical Analysis: A Computer Oriented Approach, 2nd ed. Academic Press, New York.
[3]
Belsley, D. A., Kuh, E., and Welsch, R. E. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York.
[4]
Berndt, E. R. 1991. The Practice of Econometrics. Addison-Wesley, New York.
[5]
Blake, C. and Merz, C. 1998. UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/MLRepository.html.
[6]
Bollino, C. A., Perali, F., and Rossi, N. 2000. Linear household technologies. J. Appl. Econom. 15, 253--274.
[7]
Breiman, L. 2001. Random forests. Mach. Learn. 45, 5--32.
[8]
Breiman, L. and Friedman, J. 1988. Estimating optimal transformations for multiple regression and correlation. J. Amer. Stat. Assoc. 83, 580--597.
[9]
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
[10]
Bryant, P. G. and Smith, M. A. 1996. Practical Data Analysis: Case Studies in Business Statistics, vol. 3. Irwin/McGraw Hill, New York.
[11]
Chattopadhyay, S. 2003. Divergence in alternative Hicksian welfare measures: The case of revealed preference for public amenities. J. Appl. Econom. 17, 641--666.
[12]
Chu, S. 2001. Pricing the C's of diamond stones. J. Stat. Educat. 9. http://www.amstat.org/publications/jse.
[13]
Cochran, J. J. 2000. Career records for all modern position players eligible for the Major League Baseball Hall of Fame. J. Stat. Educat. 8. http://www.amstat.org/publications/jse.
[14]
Cochran, J. J. 2002. Data management, exploratory data analysis, and regression analysis with 1969--2000 Major League Baseball Attendance. J. Stat. Educat. 10. http://www.amstat.org/publications/jse.
[15]
Cook, D. 1998. Regression Graphics: Ideas for Studying Regression Through Graphics. Wiley, New York.
[16]
Cook, D. and Weisberg, S. 1994. An Introduction to Regression Graphics. Wiley, New York.
[17]
Deb, P. and Trivedi, P. K. 1997. Demand for medical care by the elderly: A finite mixture approach. J. Appl. Econom. 12, 313--336.
[18]
Denman, N. and Gregory, D. 1998. Analysis of sugar cane yields in the Mulgrave area, for the 1997 sugar cane season. Tech. rep., MS305 Data Analysis Project, Department of Mathematics, University of Queensland, Queensland, Australia.
[19]
Delgado, M. A. and Mora, J. 1998. Testing non-nested semiparametric models: An application to Engel curves specification. J. Appl. Econom. 13, 145--162.
[20]
Fernandez, C., Ley, E., and Steel, M. F. J. 2002. Bayesian modelling of catch in a north-west Atlantic fishery. Appl. Stat. 51, 257--280.
[21]
Friedman, J. 1991. Multivariate adaptive regression splines (with discussion). Ann. Stat. 19, 1--141.
[22]
Hair, J. F. anderson, R. E., Tatham, R. L., and Black, W. C. 1998. Multivariate Data Analysis. Prentice Hall, Englewood Cliffs, NJ.
[23]
Hallin, M. and Ingenbleek, J.-F. 1983. The Swedish automobile portfolio in 1977: A statistical study. Scand. Actuarial J. 83, 49--64.
[24]
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. 1986. Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
[25]
Harrell, Jr., F. E. 2001. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer-Verlag, New York.
[26]
Hastie, T. and Tibshirani, R. 1990. Generalized Additive Models. CRC Press.
[27]
Horrace, W. C. and Schmidt, P. 2000. Multiple comparisons with the best, with economic applications. J. Appl. Econom. 15, 1--26.
[28]
Kenkel, D. S. and Terza, J. V. 2001. The effect of physician advice on alcohol consumption: Count regression with an endogenous treatment effect. J. Appl. Economet. 16, 165--184.
[29]
Kim, H., Loh, W.-Y., Shih, Y.-S., and Chaudhuri, P. 2007. A visualizable and interpretable regression model with good prediction power. IIE Transactions 39, 565--579.
[30]
Lai, T. L., Robbins, H., and Wei, C. Z. 1977. Strong consistency of least squares estimates in multiple regression. Proc. Nat. Acad. Sci., USA 75, 3034--3036.
[31]
Laroque, G. and Salanie, B. 2002. Labor market institutions and employment in France. J. Appl. Econom. 17, 25--28.
[32]
Liu, Z. and Stengos, T. 1999. Non-linearities in cross country growth regressions: A semiparametric approach. J. Appl. Econom. 14, 527--538.
[33]
Loh, W.-Y. 2002. Regression trees with unbiased variable selection and interaction detection. Stat. Sinica 12, 361--386.
[34]
Lutkepohl, H., Terasvirta, T., and Wolters, J. 1999. Investigating stability and linearity of a German M1 money demand function. J. Appl. Econom. 14, 511--525.
[35]
Martins, M. F. O. 2001. Parametric and semiparametric estimation of sample selection models: An empirical application to the female labour force in Portugal. J. Appl. Economet. 16, 23--40.
[36]
Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. 1996. Applied Linear Statistical Models, 4th ed. Irwin.
[37]
Olson, C. A. 1998. A comparison of parametric and semiparametric estimates of the effect of spousal health insurance coverage on weekly hours worked by wives. J. Appl. Econom. 13, 543--565.
[38]
Onoyama, K., Ohsumi, N., Mitsumochi, N., and Kishihara, T. 1998. Data analysis of deer-train collisions in eastern Hokkaido, Japan. In Data Science, Classification, and Related Methods, (Tokyo, Japan) C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.-H. Bock, and Y. Baba, Eds. Springer-Verlag, New York, 746--751.
[39]
Pace, R. K. and Barry, R. 1997. Sparse spatial autoregressions. Stat. Probab. Lett. 33, 291--297.
[40]
Penrose, K., Nelson, A., and Fisher, A. 1985. Generalized body composition prediction equation for men using simple measurement techniques. Med. Sci. Sports Exer. 17, 189.
[41]
Quinlan, J. R. 1992. Learning with continuous classes. In Proceedings of the Australian Joint Conference on Artificial Intelligence (Singapore), World Scientific, 343--348.
[42]
R Development Core Team. 2005. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, (Vienna, Austria). ISBN 3-900051-07-0.
[43]
Rawlings, J. O. 1988. Applied Regression Analysis: A Research Tool. Wadsworth & Brooks/Cole Advanced Books & Software.
[44]
Schafgans, M. M. 1998. Ethnic wage differences in Malaysia: Parametric and semiparametric estimation of the Chinese-Malay wage gap. J. Appl. Econom. 13, 481--504.
[45]
Simonoff, J. 1996. Smoothing Methods in Statistics. Springer-Verlag, New York.
[46]
Torgo, L. 1999. Inductive Learning of Tree-Based Regression Models. PhD thesis, Department of Computer Science, Faculty of Sciences, University of Porto.
[47]
Wang, Y. and Witten, I. 1997. Inducing model trees for continuous classes. In Proceedings of the Poster Papers of the European Conference on Machine Learning (Prague).
[48]
Weiss, S. and Indurkhya, N. 1995. Rule-based machine learning methods for functional prediction. J. Artif. Int. Res. 3, 383--403.
[49]
Witten, I. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques with JAVA Implementations, 2nd ed. Morgan Kaufmann, San Fransico, CA. http://www.cs.waikato.ac.nz/ml/weka.

Cited By

View all
  • (2024)Application of pediatric-adapted modeling and simulation approachesEssentials of Translational Pediatric Drug Development10.1016/B978-0-323-88459-4.00010-9(213-255)Online publication date: 2024
  • (2024)Fast linear model trees by PILOTMachine Language10.1007/s10994-024-06590-3113:9(6561-6610)Online publication date: 1-Sep-2024
  • (2024)Investigation on the heat transfer and pressure loss of flow boiling in smooth and microfin tubes using machine learning methodsJournal of Thermal Analysis and Calorimetry10.1007/s10973-024-13794-1149:24(15121-15141)Online publication date: 23-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 2
August 2007
89 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1267066
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2007
Published in TKDD Volume 1, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Decision tree
  2. prediction
  3. regression
  4. statistics

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)8
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Application of pediatric-adapted modeling and simulation approachesEssentials of Translational Pediatric Drug Development10.1016/B978-0-323-88459-4.00010-9(213-255)Online publication date: 2024
  • (2024)Fast linear model trees by PILOTMachine Language10.1007/s10994-024-06590-3113:9(6561-6610)Online publication date: 1-Sep-2024
  • (2024)Investigation on the heat transfer and pressure loss of flow boiling in smooth and microfin tubes using machine learning methodsJournal of Thermal Analysis and Calorimetry10.1007/s10973-024-13794-1149:24(15121-15141)Online publication date: 23-Nov-2024
  • (2023)Piecewise linear trees as surrogate models for system design and planning under high-frequency temporal variabilityEuropean Journal of Operational Research10.1016/j.ejor.2023.10.028Online publication date: Oct-2023
  • (2022)Comparative analysis of freely available digital elevation models for applications in multi-criteria environmental modeling over data limited regionsRemote Sensing Applications: Society and Environment10.1016/j.rsase.2022.10079527(100795)Online publication date: Aug-2022
  • (2021)Improving Sports Outcome Prediction Process Using Integrating Adaptive Weighted Features and Machine Learning TechniquesProcesses10.3390/pr90915639:9(1563)Online publication date: 1-Sep-2021
  • (2018)Regression based performance modeling and provisioning for NoSQL cloud databasesFuture Generation Computer Systems10.1016/j.future.2017.08.06179:P1(72-81)Online publication date: 1-Feb-2018
  • (2016)Machine learning approach for cloud NoSQL databases performance modelingProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.83(617-620)Online publication date: 16-May-2016
  • (2014)Interpolation and extrapolation: Comparison of definitions and survey of algorithms for convex and concave hulls2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)10.1109/CIDM.2014.7008683(310-314)Online publication date: Dec-2014
  • (2011)Data mining and model trees study on GDP and its influence factorsProceedings of the 11th WSEAS international conference on Applied informatics and communications, and Proceedings of the 4th WSEAS International conference on Biomedical electronics and biomedical informatics, and Proceedings of the international conference on Computational engineering in systems applications10.5555/2042791.2042866(401-406)Online publication date: 23-Aug-2011
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media