Abstract
The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.
Supplemental Material
Available for Download
Supplemental movie and image files for, Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
- Andrew Abbott and John Forrest. 1986. Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 3 (1986), 471--494. http://www.jstor.org/stable/204500.Google ScholarCross Ref
- Andrew Abbott and Alexandra Hrycak. 1990. Measuring resemblance in sequence data: An optimal matching analysis of musicians’ careers. American Journal of Sociology 96, 1 (1990), 144--185.Google ScholarCross Ref
- Silke Aisenbrey and Anette E. Fasang. 2010. New life for old ideas: The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38, 3 (2010), 420--462.Google ScholarCross Ref
- Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems. 585--591. Google ScholarDigital Library
- Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-François Paiement, Pascal Vincent, and Marie Ouimet. 2006. Spectral dimensionality reduction. In Feature Extraction. Springer, 519--550.Google Scholar
- Hilde Bras, Aart C. Liefbroer, and Cees H. Elzinga. 2010. Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography 47, 4 (2010), 1013--1034. http://www.springerlink.com/index/3547442765W022X4.pdf.Google ScholarCross Ref
- Cees H. Elzinga. 2007. Sequence Analysis: Metric Representations of Categorical Time Series. Technical Report. Department of Social Science Research Methods, Vrije Universiteit, Amsterdam.Google Scholar
- Cees H. Elzinga. 2014. Distance, similarity and sequence comparison. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, 51--73.Google Scholar
- Tak-Chung Fu. 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24, 1 (2011), 164--181. http://www.sciencedirect.com/science/article/pii/S0952197610001727. Google ScholarDigital Library
- Alexis Gabadinho, Gilbert Ritschard, Nicolas Séverin Mueller, and Matthias Studer. 2011. Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40, 4 (2011), 1--37. https://archive-ouverte.unige.ch/unige:16809.Google ScholarCross Ref
- Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2009. How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research 38, 1 (2009), 197--231.Google ScholarCross Ref
- Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2010. Multichannel sequence analysis applied to social science data. Sociological Methodology 40, 1 (2010), 1--38.Google ScholarCross Ref
- Brendan Halpin. 2012. Multiple Imputation for Life-Course Sequence Data. Technical Report WP2012-01. Department of Sociology, University of Limerick.Google Scholar
- Brendan Halpin. 2014. Three narratives of sequence analysis. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, Cham, Switzerland, 75--103.Google Scholar
- Christian Hennig and Tim F. Liao. 2010. Comparing Latent Class and Dissimilarity Based Clustering for Mixed Type Variables with Application to Social Stratification. Technical Report 308. Department of Statistical Science, University College London.Google Scholar
- Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology 24, 6 (1933), 417.Google Scholar
- Ling Jin, Doris Lee, Alex Sim, Sam Borgeson, Kesheng Wu, C. Anna Spurlock, et al. 2017. Comparison of clustering techniques for residential energy behavior using smart meter data. In Proceedings of the AI for Smart Grids and Buildings Workshop.Google Scholar
- R. Burke Johnson and Anthony J. Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational Researcher 33, 7 (2004), 14--26.Google ScholarCross Ref
- Leonard Kaufman and Peter J. Rousseeuw. 1990. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, 68--125.Google Scholar
- Dimitrios Kotsakos, Goce Trajcevski, Dimitrios Gunopulos, and Charu C. Aggarwal. 2013. Time-Series Data Clustering. In Data Clustering. Chapman and Hall/CRC, 357–380.Google Scholar
- Joseph B. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (1964), 1--27.Google ScholarCross Ref
- Joseph B. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 2 (1964), 115--129.Google ScholarCross Ref
- H. Lauder, P. Brown, and A. H. Halsey. 2004. Sociology and political arithmetic: Some principles of a new policy science. British Journal of Sociology 55, 1 (2004), 3--22.Google ScholarCross Ref
- T. Warren Liao. 2005. Clustering of time series data—A survey. Pattern Recognition 38, 11 (2005), 1857--1874. http://www.sciencedirect.com/science/article/pii/S0031320305001305. Google ScholarDigital Library
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov. 2008), 2579--2605.Google Scholar
- Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443--453. http://www.sciencedirect.com/science/article/pii/0022283670900574Google ScholarCross Ref
- Raffaella Piccarreta. 2017. Joint sequence analysis: Association and clustering. Sociological Methods and Research 46, 2 (2017), 252--287.Google ScholarCross Ref
- Raffaella Piccarreta and Francesco C. Billari. 2007. Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 4 (2007), 1061--1078.Google ScholarCross Ref
- Raffaella Piccarreta and Orna Lior. 2010. Exploring sequences: A graphical tool based on multi-dimensional scaling. Journal of the Royal Statistical Society: Series A (Statistics in Society) 173, 1 (2010), 165--184.Google ScholarCross Ref
- Gary Pollock. 2007. Holistic trajectories: A study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 1 (2007), 167--183.Google ScholarCross Ref
- PSID. 2017. Panel Study of Income Dynamics Public Use Dataset. Retrieved February 9, 2019 from https://psidonline.isr.umich.edu/.Google Scholar
- Sangeeta Rani and Geeta Sikka. 2012. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications 52, 15 (2012), 8887.Google ScholarCross Ref
- Patrick Royston. 2004. Multiple imputation of missing values. Stata Journal 4, 3 (2004), 227--41. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.84808rep=rep18type=pdfGoogle ScholarCross Ref
- Bernhard Schőlkopf, Christopher J. C. Burges, and Alexander J. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA.Google Scholar
- Bernhard Schőlkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks. 583--588. Google ScholarDigital Library
- Reto Schumacher, Koenraad Matthijs, and Sarah Moreels. 2012. Migration and Reproduction in an Urbanizing Context. A Sequence Analysis of Family Life Courses in 19th Century Antwerp and Geneva. Available at https://lirias.kuleuven.be/bitstream/123456789/345904/1/WOG+working+paper+17.pdf.Google Scholar
- Katherine Stovel and Marc Bolan. 2004. Residential trajectories: Using optimal alignment to reveal the structure of residential mobility. Sociological Methods and Research 32, 4 (2004), 559--598.Google ScholarCross Ref
- Katherine Stovel, Michael Savage, and Peter Bearman. 1996. Ascription into achievement: Models of career systems at Lloyds Bank, 1890-1970. American Journal of Sociology 102, 2 (1996), 358--399.Google ScholarCross Ref
- Matthias Studer. 2013. WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences With R. Available at https://archive-ouverte.unige.ch/unige:78576.Google Scholar
- Matthias Studer and Gilbert Ritschard. 2016. What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, 2 (2016), 481--511.Google ScholarCross Ref
- Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller. 2011. Discrepancy analysis of state sequences. Sociological Methods and Research 40, 3 (2011), 471--510.Google ScholarCross Ref
- Matthias Studer, Emanuela Struffolino, and Anette E. Fasang. 2018. Estimating the relationship between time-varying covariates and trajectories: The sequence analysis multistate model procedure. Sociological Methodology 48, 1 (Aug. 2013), 103--135.Google ScholarCross Ref
- Warren S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17, 4 (1952), 401--419.Google ScholarCross Ref
- Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173. http://dl.acm.org/citation.cfm?id=321811 Google ScholarDigital Library
- Eric D. Widmer and Gilbert Ritschard. 2009. The de-standardization of the life course: Are men and women equal?Advances in Life Course Research 14, 1 (2009), 28--39. http://www.sciencedirect.com/science/article/pii/S1040260809000069Google Scholar
- Paul Wiles. 2004. Policy and sociology. British Journal of Sociology 55, 1 (2004), 31--34.Google ScholarCross Ref
Index Terms
- Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
Recommendations
Imputation Strategies for Clustering Mixed-Type Data with Missing Values
AbstractIncomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed ...
Clustering mixed numerical and categorical data with missing values
AbstractThis paper proposes a novel framework for clustering mixed numerical and categorical data with missing values. It integrates the imputation and clustering steps into a single process, which results in an algorithm named Clustering M...
k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values
Modeling Decisions for Artificial IntelligenceAbstractThis paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use ...
Comments