skip to main content
research-article

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Published:06 March 2019Publication History
Skip Abstract Section

Abstract

The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

Skip Supplemental Material Section

Supplemental Material

References

  1. Andrew Abbott and John Forrest. 1986. Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 3 (1986), 471--494. http://www.jstor.org/stable/204500.Google ScholarGoogle ScholarCross RefCross Ref
  2. Andrew Abbott and Alexandra Hrycak. 1990. Measuring resemblance in sequence data: An optimal matching analysis of musicians’ careers. American Journal of Sociology 96, 1 (1990), 144--185.Google ScholarGoogle ScholarCross RefCross Ref
  3. Silke Aisenbrey and Anette E. Fasang. 2010. New life for old ideas: The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38, 3 (2010), 420--462.Google ScholarGoogle ScholarCross RefCross Ref
  4. Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems. 585--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-François Paiement, Pascal Vincent, and Marie Ouimet. 2006. Spectral dimensionality reduction. In Feature Extraction. Springer, 519--550.Google ScholarGoogle Scholar
  6. Hilde Bras, Aart C. Liefbroer, and Cees H. Elzinga. 2010. Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography 47, 4 (2010), 1013--1034. http://www.springerlink.com/index/3547442765W022X4.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  7. Cees H. Elzinga. 2007. Sequence Analysis: Metric Representations of Categorical Time Series. Technical Report. Department of Social Science Research Methods, Vrije Universiteit, Amsterdam.Google ScholarGoogle Scholar
  8. Cees H. Elzinga. 2014. Distance, similarity and sequence comparison. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, 51--73.Google ScholarGoogle Scholar
  9. Tak-Chung Fu. 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24, 1 (2011), 164--181. http://www.sciencedirect.com/science/article/pii/S0952197610001727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexis Gabadinho, Gilbert Ritschard, Nicolas Séverin Mueller, and Matthias Studer. 2011. Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40, 4 (2011), 1--37. https://archive-ouverte.unige.ch/unige:16809.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2009. How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research 38, 1 (2009), 197--231.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2010. Multichannel sequence analysis applied to social science data. Sociological Methodology 40, 1 (2010), 1--38.Google ScholarGoogle ScholarCross RefCross Ref
  13. Brendan Halpin. 2012. Multiple Imputation for Life-Course Sequence Data. Technical Report WP2012-01. Department of Sociology, University of Limerick.Google ScholarGoogle Scholar
  14. Brendan Halpin. 2014. Three narratives of sequence analysis. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, Cham, Switzerland, 75--103.Google ScholarGoogle Scholar
  15. Christian Hennig and Tim F. Liao. 2010. Comparing Latent Class and Dissimilarity Based Clustering for Mixed Type Variables with Application to Social Stratification. Technical Report 308. Department of Statistical Science, University College London.Google ScholarGoogle Scholar
  16. Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology 24, 6 (1933), 417.Google ScholarGoogle Scholar
  17. Ling Jin, Doris Lee, Alex Sim, Sam Borgeson, Kesheng Wu, C. Anna Spurlock, et al. 2017. Comparison of clustering techniques for residential energy behavior using smart meter data. In Proceedings of the AI for Smart Grids and Buildings Workshop.Google ScholarGoogle Scholar
  18. R. Burke Johnson and Anthony J. Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational Researcher 33, 7 (2004), 14--26.Google ScholarGoogle ScholarCross RefCross Ref
  19. Leonard Kaufman and Peter J. Rousseeuw. 1990. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, 68--125.Google ScholarGoogle Scholar
  20. Dimitrios Kotsakos, Goce Trajcevski, Dimitrios Gunopulos, and Charu C. Aggarwal. 2013. Time-Series Data Clustering. In Data Clustering. Chapman and Hall/CRC, 357–380.Google ScholarGoogle Scholar
  21. Joseph B. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (1964), 1--27.Google ScholarGoogle ScholarCross RefCross Ref
  22. Joseph B. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 2 (1964), 115--129.Google ScholarGoogle ScholarCross RefCross Ref
  23. H. Lauder, P. Brown, and A. H. Halsey. 2004. Sociology and political arithmetic: Some principles of a new policy science. British Journal of Sociology 55, 1 (2004), 3--22.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Warren Liao. 2005. Clustering of time series data—A survey. Pattern Recognition 38, 11 (2005), 1857--1874. http://www.sciencedirect.com/science/article/pii/S0031320305001305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov. 2008), 2579--2605.Google ScholarGoogle Scholar
  26. Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443--453. http://www.sciencedirect.com/science/article/pii/0022283670900574Google ScholarGoogle ScholarCross RefCross Ref
  27. Raffaella Piccarreta. 2017. Joint sequence analysis: Association and clustering. Sociological Methods and Research 46, 2 (2017), 252--287.Google ScholarGoogle ScholarCross RefCross Ref
  28. Raffaella Piccarreta and Francesco C. Billari. 2007. Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 4 (2007), 1061--1078.Google ScholarGoogle ScholarCross RefCross Ref
  29. Raffaella Piccarreta and Orna Lior. 2010. Exploring sequences: A graphical tool based on multi-dimensional scaling. Journal of the Royal Statistical Society: Series A (Statistics in Society) 173, 1 (2010), 165--184.Google ScholarGoogle ScholarCross RefCross Ref
  30. Gary Pollock. 2007. Holistic trajectories: A study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 1 (2007), 167--183.Google ScholarGoogle ScholarCross RefCross Ref
  31. PSID. 2017. Panel Study of Income Dynamics Public Use Dataset. Retrieved February 9, 2019 from https://psidonline.isr.umich.edu/.Google ScholarGoogle Scholar
  32. Sangeeta Rani and Geeta Sikka. 2012. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications 52, 15 (2012), 8887.Google ScholarGoogle ScholarCross RefCross Ref
  33. Patrick Royston. 2004. Multiple imputation of missing values. Stata Journal 4, 3 (2004), 227--41. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.84808rep=rep18type=pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  34. Bernhard Schőlkopf, Christopher J. C. Burges, and Alexander J. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  35. Bernhard Schőlkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks. 583--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Reto Schumacher, Koenraad Matthijs, and Sarah Moreels. 2012. Migration and Reproduction in an Urbanizing Context. A Sequence Analysis of Family Life Courses in 19th Century Antwerp and Geneva. Available at https://lirias.kuleuven.be/bitstream/123456789/345904/1/WOG+working+paper+17.pdf.Google ScholarGoogle Scholar
  37. Katherine Stovel and Marc Bolan. 2004. Residential trajectories: Using optimal alignment to reveal the structure of residential mobility. Sociological Methods and Research 32, 4 (2004), 559--598.Google ScholarGoogle ScholarCross RefCross Ref
  38. Katherine Stovel, Michael Savage, and Peter Bearman. 1996. Ascription into achievement: Models of career systems at Lloyds Bank, 1890-1970. American Journal of Sociology 102, 2 (1996), 358--399.Google ScholarGoogle ScholarCross RefCross Ref
  39. Matthias Studer. 2013. WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences With R. Available at https://archive-ouverte.unige.ch/unige:78576.Google ScholarGoogle Scholar
  40. Matthias Studer and Gilbert Ritschard. 2016. What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, 2 (2016), 481--511.Google ScholarGoogle ScholarCross RefCross Ref
  41. Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller. 2011. Discrepancy analysis of state sequences. Sociological Methods and Research 40, 3 (2011), 471--510.Google ScholarGoogle ScholarCross RefCross Ref
  42. Matthias Studer, Emanuela Struffolino, and Anette E. Fasang. 2018. Estimating the relationship between time-varying covariates and trajectories: The sequence analysis multistate model procedure. Sociological Methodology 48, 1 (Aug. 2013), 103--135.Google ScholarGoogle ScholarCross RefCross Ref
  43. Warren S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17, 4 (1952), 401--419.Google ScholarGoogle ScholarCross RefCross Ref
  44. Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173. http://dl.acm.org/citation.cfm?id=321811 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Eric D. Widmer and Gilbert Ritschard. 2009. The de-standardization of the life course: Are men and women equal?Advances in Life Course Research 14, 1 (2009), 28--39. http://www.sciencedirect.com/science/article/pii/S1040260809000069Google ScholarGoogle Scholar
  46. Paul Wiles. 2004. Policy and sociology. British Journal of Sociology 55, 1 (2004), 31--34.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

        Recommendations

        Reviews

        Jonathan P. E. Hodgson

        The authors discuss how one can compensate for missing values when clustering joint categorical sequences. For example, one might have a set of sequences with both nominal values, such as family size, and binary values, for example, marriage status. Usually such sequences will exhibit missing values; however, discarding these sequences will result in too few sequences to allow valid conclusions. Some missing values can be inferred-for example, age-but often this is not the case. The authors consider various ways to address missing values. The principal method used is the idea of an edit distance, which measures the number and size of the changes required to edit one sequence to another. In the case where the entries in the sequence consist of multiple values, one can use the average of the edit distances for each category. Given a distance, one can proceed to infer a set of clusters of "like" sequences. Using a study of income dynamics as an exemplar exhibiting both binary and nominal values, the authors provide a detailed description of the process. They give experimental results comparing various choices for distance. The income dynamics sequences provide sufficient variety to allow several measures for distance. By using dimension reduction techniques, the authors are able to provide visual representations of the clusters corresponding to the distance choices. The paper is a useful guide to the available techniques, with sufficient illustrative examples so that readers can apply the ideas.

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of Data and Information Quality
          Journal of Data and Information Quality  Volume 11, Issue 2
          On the Horizon, Experience Paper and Regular Papers
          June 2019
          66 pages
          ISSN:1936-1955
          EISSN:1936-1963
          DOI:10.1145/3317030
          Issue’s Table of Contents

          Copyright © 2019 ACM

          © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 March 2019
          • Accepted: 1 December 2018
          • Revised: 1 October 2018
          • Received: 1 May 2018
          Published in jdiq Volume 11, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format