ABSTRACT
Finding disease-related genes is important in drug discovery. Many genes are involved in the disease, and many studies have been conducted and reported for each disease. However, it is very costly to check these one by one. Therefore, machine learning is a suitable method to address this problem. By extracting study results from research papers by text mining, it is possible to make use of that knowledge. In this research, we aim to extract disease-related genes from PubMed papers using word2vec, which is a text mining method. The method extracts the top 10 genes whose known disease genes and vectors are close to those obtained by word2vec. Based on these, genes other than known disease-related genes are extracted and used as disease-related genes. We conducted experiments using schizophrenia, and confirmed the likelihood of this disease-related gene using xgboost. Pattern 1: Only known genes. Pattern 2: Pattern 1 plus disease-related genes extracted in this study. Pattern 3: Pattern 1 plus the same number of random genes. Using these three patterns, we performed a xgboost with microarray data and compared the classification accuracy. The result was that Pattern 2 had the highest accuracy. Therefore, we could extract genes with using genes related to disease by our method.
- Al-Mubaid H, Singh RK.(2005). A new text mining approach for finding protein-to-disease associations. Am J Biochem Biotechnol, pp. 145--152.Google Scholar
- Shahin Mohammadi, Sudhir Kylasa, Giorgos Kollias(2016). Context-specific Recommendation System for Pre-dicting Similar PubMed Articles. Data Mining Workshops (ICDMW)Google Scholar
- Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed represen-tations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119. Google ScholarDigital Library
- Goldberg, Y. and Levy, O. (2014). word2vec explained: deriving mikolov et al.'s nega-tivesampling word-embedding method. arXiv:1402.3722 {cs, stat}. arXiv: 1402.3722.Google Scholar
- Segura-Bedmar, I., Suarez-Paniagua, V., Martinez, P. (2015). Exploring word embedding for drug name recogni-tion. In: Sixth Workshop on Health Text Mining and Information Analysis.Google ScholarCross Ref
- Segura-Bedmar, I., Suarez-Paniagua, V., Martinez, P. (2015). Exploring word embedding for drug name recogni-tion. In: Sixth Workshop on Health Text Mining and Information Analysis.Google ScholarCross Ref
- J. A. Miñrro-Giménez, O. Marín-Alonso, and M. Samwald(2015). Applying deep learning techniques on med-ical corpora from the world wide web: a prototypical system and evaluation. arXiv preprint arXiv: 1502.03682.Google Scholar
- Chiu, B., Crichton, G., Korhonen, A. and Pyysalo, S., (2016). How to train good word embeddings for biomedical NLP. ACL 2016, p.166.Google ScholarCross Ref
- PubMed. The National Center for Biotechnology Information(NCBI)http://www.ncbi.nlm.nih.gov/pubmedGoogle Scholar
- Gene. The National Center for Biotechnology Information(NCBI). https://www.ncbi.nlm.nih.gov/geneGoogle Scholar
- DisGeNET.http://www.disgenet.org/Google Scholar
Index Terms
- Extraction of disease-related genes from PubMed paper using word2vec
Recommendations
A survey of disease connections for CD4+ T cell master genes and their directly linked genes
HighlightsCD4+ T cell subtype master genes and their connected genes are more likely to be associated with a disease or a phenotype.Genes connected to the CD4+ T cell subtype master genes are more likely to be transcription factors.CD4+ T cell subtype ...
ISN: Inferring disease-related genes using seed gene and network analysis
2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC)In biology, text-mining is widely used to extract relationships between biological entities. Gene prioritization is also important to analyze diseases, because mutated or dysregulated genes play an important role in pathogenesis. Here, we propose a method ...
Flexible Non-Negative Matrix Factorization to Unravel Disease-Related Genes
Recently, non-negative matrix factorization (NMF) has been shown to perform well in the analysis of omics data. NMF assumes that the expression level of one gene is a linear additive composition of metagenes. The elements in metagene matrix represent the ...
Comments