skip to main content
10.1145/1774088.1774304acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

ABBA: adaptive bicluster-based approach to impute missing values in binary matrices

Published:22 March 2010Publication History

ABSTRACT

Missing values frequently pose problems in binary matrices analysis since they can hinder downstream analysis of the datasets. Despite the presence of many imputation methods that have been developed to substitute missing values with estimated values, these available techniques have some common disadvantages: they need to fix some parameters (e.g., number of patterns, number of rows to consider) to estimate missing values---with little theoretical support to determine these parameters---; and, missing values need to be recomputed from scratch as parameters change.

In this paper we propose a novel algorithm (ABBA: Adaptive Bicluster-Based Approach) that does not have the above limitations. Further, a formal framework that justifies the rationales behind ABBA is detailed. Finally, experimental results over both synthetic and real data confirm the viability of our approach and the quality of the results, that overcomes the ones achieved by the main competing algorithm (KNN).

References

  1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D.C., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. H. Armin, A. O. Schmitt, J. Lange, S. Meier-ewert, H. Lehrach, and R. Shamir. An algorithm for clustering DNA fingerprints. Genomics, 66:249--256, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. M. Bishop. Variational principal components. In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99, pages 509--514, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Colantonio, R. Di Pietro, and A. Ocello. A cost-driven approach to role engineering. In Proceedings of the 23rd ACM Symposium on Applied Computing, SAC '08, volume 3, pages 2129--2136, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Colantonio, R. Di Pietro, and A. Ocello. Leveraging lattices to improve role mining. In Proceedings of the IFIP TC 11 23rd International Information Security Conference, SEC '08, pages 333--347, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. A formal framework to elicit roles with business meaning in RBAC systems. In Proceedings of the 14th ACM Symposium on Access Control Models and Technologies, SACMAT '09, pages 85--94, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. Mining stable roles in RBAC. In Proceedings of the IFIP TC 11 24th International Information Security Conference, SEC '09, pages 259--269, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. A probabilistic bound on the basic role mining problem and its applications. In Proceedings of the IFIP TC 11 24th International Information Security Conference, SEC '09, pages 376--386, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  9. B. S. Everitt. Cluster Analysis. Edward Arnold and Halsted Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Figueroa, J. Borneman, and T. Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. In CSB '03: Proceedings of the IEEE Computer Society Conference on Bioinformatics, pages 38--47, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Kim, G. H. Golub, and H. Park. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187--198, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kim, E. R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J. M. Trent, and M. Bittner. Multivariate measurement of gene expression relationships. GENOMICS, 67:201--209, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics. Wiley, New York, 1st edition, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Liu, S. Paulsen, W. Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets from noisy data. In Proceedings of the 5th IEEE International Conference on Data Mining, ICDM '05, pages 721--724, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Lu, J. Vaidya, and V. Atluri. Optimal boolean matrix decomposition: Application to role engineering. In Proceedings of the 24th IEEE International Conference on Data Engineering, ICDE '08, pages 297--306, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. A. Mahfouz and M. A. Ismail. BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. In Proceedings of World Academy of Science, Engineering and Technology, PWASET, volume 37, pages 342--348, 2009.Google ScholarGoogle Scholar
  18. S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, and S. Ishii. A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088--2096, November 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. K. Puolamäki, M. Fortelius, and H. Mannila. Seriation in paleontological data using Markov Chain Monte Carlo methods. PLoS Computational Biology, 2(2), February 2006.Google ScholarGoogle Scholar
  20. S. Raychaudhuri, J. M. Stuart, and R. B. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput, pages 452--463, 2000.Google ScholarGoogle Scholar
  21. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, December 1976.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. B. Rubin. Multiple imputation for nonresponse in surveys. Wiley, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Schafer. Analysis of Incomplete Multivariate Data. Number 72 in Monographs on Statistics and Applied Probability. Chapman Hall/CRC, 1997.Google ScholarGoogle Scholar
  24. J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  25. I. Shmulevich and W. Zhang. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics, 18(4):555--565, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  26. O. G. Troyanskaya, M. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics, 9(202), April 2008.Google ScholarGoogle Scholar
  28. M. J. Zaki and C.-J. Hsiao. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4):462--478, April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ABBA: adaptive bicluster-based approach to impute missing values in binary matrices

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing
        March 2010
        2712 pages
        ISBN:9781605586397
        DOI:10.1145/1774088

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 March 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SAC '10 Paper Acceptance Rate364of1,353submissions,27%Overall Acceptance Rate1,650of6,669submissions,25%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader