ABSTRACT
Missing values frequently pose problems in binary matrices analysis since they can hinder downstream analysis of the datasets. Despite the presence of many imputation methods that have been developed to substitute missing values with estimated values, these available techniques have some common disadvantages: they need to fix some parameters (e.g., number of patterns, number of rows to consider) to estimate missing values---with little theoretical support to determine these parameters---; and, missing values need to be recomputed from scratch as parameters change.
In this paper we propose a novel algorithm (ABBA: Adaptive Bicluster-Based Approach) that does not have the above limitations. Further, a formal framework that justifies the rationales behind ABBA is detailed. Finally, experimental results over both synthetic and real data confirm the viability of our approach and the quality of the results, that overcomes the ones achieved by the main competing algorithm (KNN).
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207--216, Washington, D.C., 1993. Google ScholarDigital Library
- E. H. Armin, A. O. Schmitt, J. Lange, S. Meier-ewert, H. Lehrach, and R. Shamir. An algorithm for clustering DNA fingerprints. Genomics, 66:249--256, 2000.Google ScholarCross Ref
- C. M. Bishop. Variational principal components. In Proceedings Ninth International Conference on Artificial Neural Networks, ICANN'99, pages 509--514, 1999.Google ScholarCross Ref
- A. Colantonio, R. Di Pietro, and A. Ocello. A cost-driven approach to role engineering. In Proceedings of the 23rd ACM Symposium on Applied Computing, SAC '08, volume 3, pages 2129--2136, 2008. Google ScholarDigital Library
- A. Colantonio, R. Di Pietro, and A. Ocello. Leveraging lattices to improve role mining. In Proceedings of the IFIP TC 11 23rd International Information Security Conference, SEC '08, pages 333--347, 2008.Google ScholarCross Ref
- A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. A formal framework to elicit roles with business meaning in RBAC systems. In Proceedings of the 14th ACM Symposium on Access Control Models and Technologies, SACMAT '09, pages 85--94, 2009. Google ScholarDigital Library
- A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. Mining stable roles in RBAC. In Proceedings of the IFIP TC 11 24th International Information Security Conference, SEC '09, pages 259--269, 2009.Google ScholarCross Ref
- A. Colantonio, R. Di Pietro, A. Ocello, and N. V. Verde. A probabilistic bound on the basic role mining problem and its applications. In Proceedings of the IFIP TC 11 24th International Information Security Conference, SEC '09, pages 376--386, 2009.Google ScholarCross Ref
- B. S. Everitt. Cluster Analysis. Edward Arnold and Halsted Press, 1993. Google ScholarDigital Library
- A. Figueroa, J. Borneman, and T. Jiang. Clustering binary fingerprint vectors with missing values for DNA array data analysis. In CSB '03: Proceedings of the IEEE Computer Society Conference on Bioinformatics, pages 38--47, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition, 2006. Google ScholarDigital Library
- H. Kim, G. H. Golub, and H. Park. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2):187--198, 2005. Google ScholarDigital Library
- S. Kim, E. R. Dougherty, Y. Chen, K. Sivakumar, P. Meltzer, J. M. Trent, and M. Bittner. Multivariate measurement of gene expression relationships. GENOMICS, 67:201--209, 2000.Google ScholarCross Ref
- R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics. Wiley, New York, 1st edition, 1987. Google ScholarDigital Library
- J. Liu, S. Paulsen, W. Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets from noisy data. In Proceedings of the 5th IEEE International Conference on Data Mining, ICDM '05, pages 721--724, 2005. Google ScholarDigital Library
- H. Lu, J. Vaidya, and V. Atluri. Optimal boolean matrix decomposition: Application to role engineering. In Proceedings of the 24th IEEE International Conference on Data Engineering, ICDE '08, pages 297--306, 2008. Google ScholarDigital Library
- M. A. Mahfouz and M. A. Ismail. BIDENS: Iterative density based biclustering algorithm with application to gene expression analysis. In Proceedings of World Academy of Science, Engineering and Technology, PWASET, volume 37, pages 342--348, 2009.Google Scholar
- S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, and S. Ishii. A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088--2096, November 2003.Google ScholarCross Ref
- K. Puolamäki, M. Fortelius, and H. Mannila. Seriation in paleontological data using Markov Chain Monte Carlo methods. PLoS Computational Biology, 2(2), February 2006.Google Scholar
- S. Raychaudhuri, J. M. Stuart, and R. B. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput, pages 452--463, 2000.Google Scholar
- D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, December 1976.Google ScholarCross Ref
- D. B. Rubin. Multiple imputation for nonresponse in surveys. Wiley, 1987.Google ScholarCross Ref
- J. Schafer. Analysis of Incomplete Multivariate Data. Number 72 in Monographs on Statistics and Applied Probability. Chapman Hall/CRC, 1997.Google Scholar
- J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 2002.Google ScholarCross Ref
- I. Shmulevich and W. Zhang. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics, 18(4):555--565, 2002.Google ScholarCross Ref
- O. G. Troyanskaya, M. Cantor, G. Sherlock, P. O. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520--525, 2001.Google ScholarCross Ref
- J. Tuikkala, L. L. Elo, O. S. Nevalainen, and T. Aittokallio. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Bioinformatics, 9(202), April 2008.Google Scholar
- M. J. Zaki and C.-J. Hsiao. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4):462--478, April 2005. Google ScholarDigital Library
Index Terms
- ABBA: adaptive bicluster-based approach to impute missing values in binary matrices
Recommendations
A review on missing values for main challenges and methods
AbstractSeveral recent reviews summarize common missing value analysis methods. However, none of them provide a systematic and in-depth summary of the analytical challenges and solutions for dealing with missing values. For the purpose of guiding the ...
Highlights- Analyzed three major difficulties with missing value analysis.
- Provided a comprehensive introduction to deletion and imputation missing approaches.
- Reviewed and analyzed numerous studies and provide useful rules for processing ...
Gaussian processes for missing value imputation
AbstractA missing value indicates that a particular attribute of an instance of a learning problem is not recorded. They are very common in many real-life datasets. In spite of this, however, most machine learning methods cannot handle missing values. ...
Highlights- A novel approach based on chained GPs, named MGP, is introduced for imputing missing values.
- The method outputs a predictive distribution for each missing value in the dataset.
- The final model can be trained simultaneously and can ...
An effective method for classification with missing values
Classification is one of the most important tasks in machine learning with a huge number of real-life applications. In many practical classification problems, the available information for making object classification is partial or incomplete because ...
Comments