research-article

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

Authors:
Alina Lazar

Youngstown State University, Youngstown, OH

Youngstown State University, Youngstown, OH

0000-0002-2096-1541
View Profile

,
Ling Jin

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA

0000-0002-4381-195X
View Profile

,
C. Anna Spurlock

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Kesheng Wu

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Alex Sim

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Annika Todd

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 11 Issue 2Article No.: 7pp 1–22https://doi.org/10.1145/3301294

Published:06 March 2019Publication History

Journal of Data and Information Quality

Abstract

The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an “edit” distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.

Supplemental Material

Available for Download

pdf

a7-lazar-suppl.pdf (8.3 MB)

Supplemental movie and image files for, Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization

zip

a7-lazar.zip (2 MB)

References

Andrew Abbott and John Forrest. 1986. Optimal matching methods for historical sequences. Journal of Interdisciplinary History 16, 3 (1986), 471--494. http://www.jstor.org/stable/204500.Google ScholarCross Ref
Andrew Abbott and Alexandra Hrycak. 1990. Measuring resemblance in sequence data: An optimal matching analysis of musicians’ careers. American Journal of Sociology 96, 1 (1990), 144--185.Google ScholarCross Ref
Silke Aisenbrey and Anette E. Fasang. 2010. New life for old ideas: The “second wave” of sequence analysis bringing the “course” back into the life course. Sociological Methods and Research 38, 3 (2010), 420--462.Google ScholarCross Ref
Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems. 585--591. Google ScholarDigital Library
Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux, Jean-François Paiement, Pascal Vincent, and Marie Ouimet. 2006. Spectral dimensionality reduction. In Feature Extraction. Springer, 519--550.Google Scholar
Hilde Bras, Aart C. Liefbroer, and Cees H. Elzinga. 2010. Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography 47, 4 (2010), 1013--1034. http://www.springerlink.com/index/3547442765W022X4.pdf.Google ScholarCross Ref
Cees H. Elzinga. 2007. Sequence Analysis: Metric Representations of Categorical Time Series. Technical Report. Department of Social Science Research Methods, Vrije Universiteit, Amsterdam.Google Scholar
Cees H. Elzinga. 2014. Distance, similarity and sequence comparison. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, 51--73.Google Scholar
Tak-Chung Fu. 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24, 1 (2011), 164--181. http://www.sciencedirect.com/science/article/pii/S0952197610001727. Google ScholarDigital Library
Alexis Gabadinho, Gilbert Ritschard, Nicolas Séverin Mueller, and Matthias Studer. 2011. Analyzing and visualizing state sequences in R with TraMineR. Journal of Statistical Software 40, 4 (2011), 1--37. https://archive-ouverte.unige.ch/unige:16809.Google ScholarCross Ref
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2009. How much does it cost? Optimization of costs in sequence analysis of social science data. Sociological Methods and Research 38, 1 (2009), 197--231.Google ScholarCross Ref
Jacques-Antoine Gauthier, Eric D. Widmer, Philipp Bucher, and Cédric Notredame. 2010. Multichannel sequence analysis applied to social science data. Sociological Methodology 40, 1 (2010), 1--38.Google ScholarCross Ref
Brendan Halpin. 2012. Multiple Imputation for Life-Course Sequence Data. Technical Report WP2012-01. Department of Sociology, University of Limerick.Google Scholar
Brendan Halpin. 2014. Three narratives of sequence analysis. In Advances in Sequence Analysis: Theory, Method, Applications. Springer, Cham, Switzerland, 75--103.Google Scholar
Christian Hennig and Tim F. Liao. 2010. Comparing Latent Class and Dissimilarity Based Clustering for Mixed Type Variables with Application to Social Stratification. Technical Report 308. Department of Statistical Science, University College London.Google Scholar
Harold Hotelling. 1933. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology 24, 6 (1933), 417.Google Scholar
Ling Jin, Doris Lee, Alex Sim, Sam Borgeson, Kesheng Wu, C. Anna Spurlock, et al. 2017. Comparison of clustering techniques for residential energy behavior using smart meter data. In Proceedings of the AI for Smart Grids and Buildings Workshop.Google Scholar
R. Burke Johnson and Anthony J. Onwuegbuzie. 2004. Mixed methods research: A research paradigm whose time has come. Educational Researcher 33, 7 (2004), 14--26.Google ScholarCross Ref
Leonard Kaufman and Peter J. Rousseeuw. 1990. Partitioning around medoids (program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, 68--125.Google Scholar
Dimitrios Kotsakos, Goce Trajcevski, Dimitrios Gunopulos, and Charu C. Aggarwal. 2013. Time-Series Data Clustering. In Data Clustering. Chapman and Hall/CRC, 357–380.Google Scholar
Joseph B. Kruskal. 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1 (1964), 1--27.Google ScholarCross Ref
Joseph B. Kruskal. 1964b. Nonmetric multidimensional scaling: A numerical method. Psychometrika 29, 2 (1964), 115--129.Google ScholarCross Ref
H. Lauder, P. Brown, and A. H. Halsey. 2004. Sociology and political arithmetic: Some principles of a new policy science. British Journal of Sociology 55, 1 (2004), 3--22.Google ScholarCross Ref
T. Warren Liao. 2005. Clustering of time series data—A survey. Pattern Recognition 38, 11 (2005), 1857--1874. http://www.sciencedirect.com/science/article/pii/S0031320305001305. Google ScholarDigital Library
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov. 2008), 2579--2605.Google Scholar
Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443--453. http://www.sciencedirect.com/science/article/pii/0022283670900574Google ScholarCross Ref
Raffaella Piccarreta. 2017. Joint sequence analysis: Association and clustering. Sociological Methods and Research 46, 2 (2017), 252--287.Google ScholarCross Ref
Raffaella Piccarreta and Francesco C. Billari. 2007. Clustering work and family trajectories by using a divisive algorithm. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 4 (2007), 1061--1078.Google ScholarCross Ref
Raffaella Piccarreta and Orna Lior. 2010. Exploring sequences: A graphical tool based on multi-dimensional scaling. Journal of the Royal Statistical Society: Series A (Statistics in Society) 173, 1 (2010), 165--184.Google ScholarCross Ref
Gary Pollock. 2007. Holistic trajectories: A study of combined employment, housing and family careers by using multiple-sequence analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170, 1 (2007), 167--183.Google ScholarCross Ref
PSID. 2017. Panel Study of Income Dynamics Public Use Dataset. Retrieved February 9, 2019 from https://psidonline.isr.umich.edu/.Google Scholar
Sangeeta Rani and Geeta Sikka. 2012. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications 52, 15 (2012), 8887.Google ScholarCross Ref
Patrick Royston. 2004. Multiple imputation of missing values. Stata Journal 4, 3 (2004), 227--41. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.84808rep=rep18type=pdfGoogle ScholarCross Ref
Bernhard Schőlkopf, Christopher J. C. Burges, and Alexander J. Smola. 1999. Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA.Google Scholar
Bernhard Schőlkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks. 583--588. Google ScholarDigital Library
Reto Schumacher, Koenraad Matthijs, and Sarah Moreels. 2012. Migration and Reproduction in an Urbanizing Context. A Sequence Analysis of Family Life Courses in 19th Century Antwerp and Geneva. Available at https://lirias.kuleuven.be/bitstream/123456789/345904/1/WOG+working+paper+17.pdf.Google Scholar
Katherine Stovel and Marc Bolan. 2004. Residential trajectories: Using optimal alignment to reveal the structure of residential mobility. Sociological Methods and Research 32, 4 (2004), 559--598.Google ScholarCross Ref
Katherine Stovel, Michael Savage, and Peter Bearman. 1996. Ascription into achievement: Models of career systems at Lloyds Bank, 1890-1970. American Journal of Sociology 102, 2 (1996), 358--399.Google ScholarCross Ref
Matthias Studer. 2013. WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences With R. Available at https://archive-ouverte.unige.ch/unige:78576.Google Scholar
Matthias Studer and Gilbert Ritschard. 2016. What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, 2 (2016), 481--511.Google ScholarCross Ref
Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, and Nicolas S. Müller. 2011. Discrepancy analysis of state sequences. Sociological Methods and Research 40, 3 (2011), 471--510.Google ScholarCross Ref
Matthias Studer, Emanuela Struffolino, and Anette E. Fasang. 2018. Estimating the relationship between time-varying covariates and trajectories: The sequence analysis multistate model procedure. Sociological Methodology 48, 1 (Aug. 2013), 103--135.Google ScholarCross Ref
Warren S. Torgerson. 1952. Multidimensional scaling: I. Theory and method. Psychometrika 17, 4 (1952), 401--419.Google ScholarCross Ref
Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. Journal of the ACM 21, 1 (1974), 168--173. http://dl.acm.org/citation.cfm?id=321811 Google ScholarDigital Library
Eric D. Widmer and Gilbert Ritschard. 2009. The de-standardization of the life course: Are men and women equal?Advances in Life Course Research 14, 1 (2009), 28--39. http://www.sciencedirect.com/science/article/pii/S1040260809000069Google Scholar
Paul Wiles. 2004. Policy and sociology. British Journal of Sociology 55, 1 (2004), 31--34.Google ScholarCross Ref

Index Terms

Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
        Dimensionality reduction and manifold learning
    2. Machine learning algorithms
      1. Feature selection

Recommendations

Imputation Strategies for Clustering Mixed-Type Data with Missing Values
Abstract
Incomplete data sets with different data types are difficult to handle, but regularly to be found in practical clustering tasks. Therefore in this paper, two procedures for clustering mixed-type data with missing values are derived and analyzed ...
Read More
Clustering mixed numerical and categorical data with missing values
Abstract
This paper proposes a novel framework for clustering mixed numerical and categorical data with missing values. It integrates the imputation and clustering steps into a single process, which results in an algorithm named Clustering M...
Read More
k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values
Modeling Decisions for Artificial Intelligence
Abstract
This paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use ...
Read More

Reviews

Reviewer: Jonathan P. E. Hodgson

The authors discuss how one can compensate for missing values when clustering joint categorical sequences. For example, one might have a set of sequences with both nominal values, such as family size, and binary values, for example, marriage status. Usually such sequences will exhibit missing values; however, discarding these sequences will result in too few sequences to allow valid conclusions. Some missing values can be inferred-for example, age-but often this is not the case. The authors consider various ways to address missing values. The principal method used is the idea of an edit distance, which measures the number and size of the changes required to edit one sequence to another. In the case where the entries in the sequence consist of multiple values, one can use the average of the edit distances for each category. Given a distance, one can proceed to infer a set of clusters of "like" sequences. Using a study of income dynamics as an exemplar exhibiting both binary and nominal values, the authors provide a detailed description of the process. They give experimental results comparing various choices for distance. The income dynamics sequences provide sufficient variety to allow several measures for distance. By using dimension reduction techniques, the authors are able to provide visual representations of the clusters corresponding to the distance choices. The paper is a useful guide to the available techniques, with sufficient illustrative examples so that readers can apply the ideas.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 11, Issue 2
On the Horizon, Experience Paper and Regular Papers
June 2019
66 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3317030
Editor:
Tiziana Catarci
Sapienza University of Rome, Rome, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2019
- Accepted: 1 December 2018
- Revised: 1 October 2018
- Received: 1 May 2018
Published in jdiq Volume 11, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Joint sequence analysis
data quality
dimensionality reduction
life trajectories
missing values
optimal matching
t-SNE
time series clustering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 353
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format