skip to main content
10.1145/1526709.1526723acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Latent space domain transfer between high dimensional overlapping distributions

Published: 20 April 2009 Publication History

Abstract

Transferring knowledge from one domain to another is challenging due to a number of reasons. Since both conditional and marginal distribution of the training data and test data are non-identical, model trained in one domain, when directly applied to a different domain, is usually low in accuracy. For many applications with large feature sets, such as text document, sequence data, medical data, image data of different resolutions, etc. two domains usually do not contain exactly the same features, thus introducing large numbers of "missing values" when considered over the union of features from both domains. In other words, its marginal distributions are at most overlapping. In the same time, these problems are usually high dimensional, such as, several thousands of features. Thus, the combination of high dimensionality and missing values make the relationship in conditional probabilities between two domains hard to measure and model. To address these challenges, we propose a framework that first brings the marginal distributions of two domains closer by "filling up" those missing values of disjoint features. Afterwards, it looks for those comparable sub-structures in the "latent-space" as mapped from the expanded feature vector, where both marginal and conditional distribution are similar. With these sub-structures in latent space, the proposed approach then find common concepts that are transferable across domains with high probability. During prediction, unlabeled instances are treated as "queries", the mostly related labeled instances from out-domain are retrieved, and the classification is made by weighted voting using retrieved out-domain examples. We formally show that importing feature values across domains and latent semantic index can jointly make the distributions of two related domains easier to measure than in original feature space, the nearest neighbor method employed to retrieve related out domain examples is bounded in error when predicting in-domain examples. Software and datasets are available for download.

References

[1]
Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J Mach. Learn. Res., 7:2399--2434, 2006.
[2]
Steffen Bickel, Michael Bruckner, and Tobias Scheffer. Discriminative learning for differing training and test distributions. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 81--88, New York, NY, USA, 2007. ACM.
[3]
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
[4]
T. Cover and P. Hart. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):21--27, 1967.
[5]
Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. Boosting for transfer learning. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 193--200, New York, NY, USA, 2007. ACM.
[6]
Chris Ding and Xiaofeng He. K-means clustering via principal component analysis. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 29, New York, NY, USA, 2004. ACM.
[7]
P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the singular value decomposition. Mach. Learn., 56(1-3):9--33.
[8]
Wei Fan and Ian Davidson. On sample selection bias and its efficient correction via model averaging and unlabeled examples. In SDM, 2007.
[9]
Jing Gao, Wei Fan, Jing Jiang, and Jiawei Han. Knowledge transfer via multiple model local structure mapping. In Proceedings of the 2008 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.
[10]
Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Spectral domain-transfer learning. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 488--496, New York, NY, USA, 2008. ACM.
[11]
Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, and Yong Yu. Can chinese web pages be classified with english data source? In WWW, pages 969--978, 2008.
[12]
Jiangtao Ren, Xiaoxiao Shi, Wei Fan, and Philip S. Yu. Type-independent correction of sample selection bias via structural discovery and re-balancing. In SDM, pages 565--576, 2008.
[13]
Philippe Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. J. Mach. Learn. Res., 8:1369--1392, 2007.
[14]
A. J. Smola and B. Schoelkopf. A tutorial on support tor regression, 1998
[15]
Vladimir N. Vapnik. The nature of statistical learning theory. Springer--Verlag New York, Inc., New York, NY, USA, 1995.
[16]
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999.
[17]
Gui-Rong Xue, Wenyuan Dai, Qiang Yang, and Yong Yu. Topic-bridged plsa for cross-domain text classification. pages 627--634. SIGIR, 2008.
[18]
Keisuke Yamazaki, Motoaki Kawanabe, Sumio Watanabe, Masashi Sugiyama, and Klaus-Robert Muller. Asymptotic bayesian generalization error when training and test distributions are different. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 1079--1086, New York, NY, USA, 2007. ACM.

Cited By

View all
  • (2021)Learning Latent Variable Models with Discriminant RegularizationAgents and Artificial Intelligence10.1007/978-3-030-71158-0_18(378-398)Online publication date: 14-Mar-2021
  • (2018)Knowledge Transfer with Low-Quality DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.7524:10(1789-1802)Online publication date: 31-Dec-2018
  • (2017)Transfer Learning for Cross-Platform Software Crowdsourcing Recommendation2017 24th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC.2017.33(269-278)Online publication date: Dec-2017
  • Show More Cited By

Index Terms

  1. Latent space domain transfer between high dimensional overlapping distributions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '09: Proceedings of the 18th international conference on World wide web
    April 2009
    1280 pages
    ISBN:9781605584874
    DOI:10.1145/1526709

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. high dimensional
    2. latent
    3. missing value
    4. text mining
    5. transfer learning

    Qualifiers

    • Research-article

    Conference

    WWW '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Learning Latent Variable Models with Discriminant RegularizationAgents and Artificial Intelligence10.1007/978-3-030-71158-0_18(378-398)Online publication date: 14-Mar-2021
    • (2018)Knowledge Transfer with Low-Quality DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.7524:10(1789-1802)Online publication date: 31-Dec-2018
    • (2017)Transfer Learning for Cross-Platform Software Crowdsourcing Recommendation2017 24th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC.2017.33(269-278)Online publication date: Dec-2017
    • (2017)Cross domain analyzer to acquire review proficiency in big dataICT Express10.1016/j.icte.2017.04.0043:3(128-131)Online publication date: Sep-2017
    • (2017)Crop Disease Image Recognition Based on Transfer LearningImage and Graphics10.1007/978-3-319-71607-7_48(545-554)Online publication date: 30-Dec-2017
    • (2015)Transitive Transfer LearningProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783295(1155-1164)Online publication date: 10-Aug-2015
    • (2014)Multi-transferStatistical Analysis and Data Mining10.1002/sam.112267:4(282-293)Online publication date: 1-Aug-2014
    • (2013)Query-dependent cross-domain ranking in heterogeneous networkKnowledge and Information Systems10.1007/s10115-011-0472-734:1(109-145)Online publication date: 1-Jan-2013
    • (2012)Multi-domain active learning for text classificationProceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2339530.2339701(1086-1094)Online publication date: 12-Aug-2012
    • (2012)Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2011.14324:11(2025-2039)Online publication date: 1-Nov-2012
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media