Application of machine learning to associative scRNA-seq data gene expression and alternative polyadenylation sites clustering

.


Introduction
In recent years, many innovative multimodal methods applied in the fields of image and natural language processing have been transferred to the analysis of biomedical research by many researchers. To mine the expression of a gene in a specific environment and action pathway, standard scRNA-seq data are frequently combined with other genomic data for association analysis. However, in the actual scientific research process, due to the relationship between funds, materials, platforms, and technology, many standard single-cell data often do not have corresponding multi-group data such as methylation data, proteomics data, or metabolome data.
Recent bioinformatics studies suggest that alternative polyadenation (APA) [1] expression profiles can be used to identify cell types by capturing key transcriptomic information about APA sites from standard scRNA-seq data and revealing intercellular dynamics between cell types. The use of the APA expression to probe intercellular dynamics not only allows the discovery of alternative patterns to gene expression profiles from standard scRNA-seq data without changing experimental * E-mail: zhougq114@126.com techniques, but also offers great potential for the development of efficient methods to resolve cell types. Thus, APA site expression data can be used as transcriptome isoform information instead of DNA methylation, open chromatin, or proteome information with gene expression data for multi-view analysis. In the state-of-the-art studies, scLAPA [1] based on similarity network fusion (SNF) [2] is the most advanced approach. The three proposed methodological strategies applied to fusion clustering of single-cell gene expression and APA sites in this paper, with the exception of the method based on unsupervised autoencoder fusion embedding, show improved results compared to existing state-of-the-art methods in all five datasets.
The organization of this paper is as follow ： the methods are presented in section 2.Section 3 is the results. Section 4 is the discussion. Section 5 is the conclusion.
[3] and scAPAtrap (Tao et al., 2020) [4], were used in this study to identify APA sites and to quantify the transcriptome level of these sites. SCAPTURE identifies peaks using the findPeaks command from HOMOER and trains the sequence shifts of peaks using an embedded deep learning network, DeepPASS, after which it evaluates the predicted high-confidence peaks and validates them with the collected APA sites. scAPAtrap uses the region matrix function in the R packages named derfinder [5] to identify potential peaks at the whole genome level. Afterwards, peaks with widths >1000 bps were used as the threshold for cleavage based on the size of the peak area, using a quarter digit of the total read coverages, and this step was repeated until all peaks were not broad peaks.
We collected three gold standard datasets [1] and two silver standard datasets [6] as benchmarks to test the methods we needed to validate.We used scAPAtrap to identify APA sites in the three datasets of the gold standard dataset and used SCAPTURE to identify APA sites in other two datasets of the silver standard dataset.
We quantified the APA sites at the transcriptome level for each of the five datasets to obtain the PA matrix, and the initial data for the gene expression were quantified as the GE matrix. We used the FindVariousFeatures function from Seurat [7] to extract the top 2000 highly expressed genes in the GE matrix and the top 2000 highly expressed APA sites in the PA matrix.
The HY-dataset is a scRNA-seq dataset of mouse hypothalamus composed of 727 single cell data contains 7 categories. The EPI-dataset is a scRNA-seq dataset of mouse mammary epithelial cells composed of 2127 single cell contains 5 categories. The TAIR-dataset is a scRNA-seq dataset of Arabidopsis roots composed of 1473 single-cell data contains 7 categories. PBMC-4K is a scRNA-seq dataset of 4K human peripheral blood cells composed of 4292 single cell data. It contain 11 categories. PBMC-8K is a scRNA-seq dataset of 8K human peripheral composed of 8352 single cells and contains 11 categories. The specific cell type of five datasets is shown as Table 1.

Algorithm scheme
We validated the effectiveness of multiple machine learning and deep learning approaches for the optimization of clustering of single-cell gene expression data associated with APA sites. Recent studies have found that supervised learning has the advantage of fast training， but has some limitations in accuracy, compared to unsupervised learning, which can substantially compensate for the shortcomings of supervised learning. Therefore, we designed three schemes to build the fusion clustering part of the workflow.Adjusted Rand Index (ARI) [18] are used to evaluate the performance of the proposed algorithm and the baselines. To avoid the randomness, we run all the algorithms 5 times and report their average values.

Based on unsupervised spectral clustering algorithms
There is a consensus on the ground truth of the Laplacian matrix among all the views. Typically, the consensus Laplacian matrix is unknown. However, it can be approximated by a weighted combination of Laplacian matrices for each view [8].

Based on unsupervised autoencoder fusion embedding
Autoencoder is a deep neural network. It mainly consists of an encoder and a decoder, both of which are multilayer neural networks and can be represented by Equation (2) and Equation (3): There are two types of model structures based on autoencoders including early fusion and late fusion.
The ���� views vectors are into a feature vector X . It's the early fusion way. Therefore, the encoder and the decoder can be represented as � ������� � � and � � ������� � �. And the other late fusion way is that ���� autoencoder used to perform feature extraction on the ���� views vectors. The encoder and decoder can be expressed in Equation (4) and Equation (5), respectively.
Finally, the latent features � of each views were concatenated as multi-views fusion features ������ . Therefore, in this scheme, we validated 10 methods based on unsupervised autoencoder s fusion embedding. These methods were built by (Leng et al, 2022) [15] for the validation of fusion clustering of biological multimodal data. These methods, which consist of the combined five autoencoders and varied with two fusion strategies is shown as Table2.

Based on supervised deep learning model
The third scheme is inspired by MOGONET 错误!未找 到引用源。. It combines graph convolutional networks (GCN) for multi-omics-specific learning and the VCDN [17] for multi-omics integration. It is mainly divided into two parts: 1. Initial feature prediction of individual classes of each omics dataset in GCNs 2. Using the results of the initial prediction, a cross-omics discovery tensor is constructed and sent to VCDN for training. The specific application to association single-cell gene expression and APA sites association clustering is shown in Figure1. Besides, we also combine fully connected fully connected neural network (NN) for multi-omics-specific learning and the VCDN for multi-omics integration to compare against MOGONET (GCN-VCDN).

Results
When we tested the GCN-VCDN model [16], we found that one of the hyper-parameters, k, has a significant impact on the experimental results. In this model, the hyper-parameter k represents the average number of edges of the sample in the similarity network. If k is too large, the similarity network is too dense, which will lead to noise generation. On the contrary, if k is too small, the correlation between samples in the similarity network may be lost. Therefore, we tested the performance of different hyper-parameters on each data set and took the best score as the score for GCN-VCDN in the later data analysis, as shown in Figure 2.
In addition to the designed strategy approach, we compared three single-cell gene expression matrix clustering methods, SINCERA [19], SNN-Clip [20], and dynamic Tree Out [21]. The clustering visualization comparison diagram of different algorithms on the five datasets is shown in Table 3.
From Table 3, it can be seen that the algorithms based on supervised deep learning models are the best in terms of ARI. The average ARI of the method based on the supervised deep learning model was improved by 70.93% compared to the optimal method of single-cell gene expression matrix clustering and by 37.1% compared to scLAPA. Based on unsupervised spectral clustering algorithms, they also achieved better results, and their average ARI improved by 32.58% compared with the optimal method of single-cell gene expression matrix clustering and 6.3% compared with scLAPA. Nevertheless, the method based on unsupervised autoencoder fusion embedding improved the average ARI by 11.16% compared to the optimal method of single-cell gene expression matrix clustering. However, it decreased by 10.87% compared to scLAPA.

Discussion
The results demonstrate the validity of the APA sites information as other modal data for scRNA-seq data again . The approach based on supervised deep learning model have a significant improvement against clustering using gene expression data alone. This shows that the method of self-training single data using neural networks is effective before iterative training by constructing a cross-tensor, provided that labels are available. In the mean time, the scheme without any label annotations also has a significant improvement over using only gene expression profile clustering. However,the method based on unsupervised autoencoder fusion embedding didn't perform even though it exceeds the method using only gene expression profile clustering. Without any label to train, the method based on spectral clustering maybe perform better than deep learning methods that rely on training labels through optimization. In practical studies, there is not much data that has been benchmark corrected and has labels. The advanced unsupervised or semisupervised methods are more widely used. With the development of deep learning, the models trained by getting rid of artificial labels are more valuable. The fusion embedding method based on self-encoder, although the performance is normal but reach the requirement of solving the problem. It also provides a research direction to the method researchers.

Conclusion
This paper examines the different forcing performance of 18 methods on five datasets from three schemes in the study of multimodal association analysis using only standard scRNA-seq data. All three schemes show greater enhancements compare to advance method . These three schemes offer other researchers wider ideas for multimodal data analysis in biomedicine.