Statistical detecting of genes associated with PIK3C2B on lung disease

. Statistical gene detection plays an important role in biostatistics and bioinformatics. So far, many gene loci associated with human complex disease have been found by statistical methods. However, it is difficult to find all the mutation genes that are associated with a certain disease. Researchers need to detect more associated genes aiming at a disease so that human will conquer the disease one day. In this paper, we considered a real and big data set and study the detection problem of genes associated with the PIK3C2B gene on lung disease. 168 significant genes associated with the PIK3C2B gene were detected at nominal significance level 0.001 by using statistical multiple testing method. The detected genes will provide some reference to further study the function of the PIK3C2B gene to lung disease for biologists and medical scientists.


Introduction
Biology life has made rapid development in recent years, in which statistical methods play an important role [1,2]. Many people in the world suffer from some complex diseases (e.g., diabetes, cancers, Alzheimer's disease and so on), and these diseases are controlled by some mutation genes. Now it is glad to see that lots of gene loci associated with some important phenotypes/disease have been detected via statistical methods by researchers, and these loci are further validated by medical scientists [3,4].
From view of biostatistics, researchers prefer to use linkage or association analysis method to conduct gene mapping. And from the perspective of bioinformatics, analyzing gene expression data is an effective way to find the latent genes or proteins that are responsible to a certain disease [5]. Many researchers engage in the analysis of gene expression data of cancers, and made significant progress on conquering this kind stubborn diseases [6,7].
Lung cancer is a common complex disease and its incidence is higher than many other diseases. Many people die because of this disease every year. Recently more and more researchers try to study the genetic mechanism of lung cancer [8][9][10][11]. The PIK3C2B gene is considered to be an important gene that affects lung diseases, especially for lung cancer [12], and there is report that it is also related to some other complex diseases [13,14]. The Cancer Cell Line Encyclopedia (CCLE) project was launched by the Broad Institute, and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation, and the CCLE provides public access to genomic data, analysis and visualization for more than 1100 cell lines (https://portals.broadinstitute. org/ccle/).
In this paper, we downloaded a data set of gene expression from the CCLE, and mainly analyzed the correlation between the PIK3C2B gene and other genes when acting on trait of lung cancer. We tested the degree of linear correlation between random variables using the Pearson correlation coefficient, which allows us to obtain genes that are positively, negatively, or uncorrelated with PIK3C2B gene. In addition, we constructed a linear relationship between 1 gene and other genes using a linear regression model and performed hypothesis testing. The results can further investigate the pathogenicity of genes, provide some valuable references for the relationship between genes and lung cancer traits, and provide theoretical support for further medical research.

Material and methods
The gene expression data that we downloaded from the CCLE is a typical big data, the original data set was studied for lung cancer traits and genotypes, which include 56202 genes and their mRNA expression values on 1019 cell lines, but the original data did not classify the lung cancer types. Among these 1019 cell lines, there are 188 ones that aim at lung tissues of patients with lung cancer.
The expression values of the PIK3C2B gene locate at the 4322th line of the data set. The distribution histogram of the expression values of the PIK3C2B gene on 188 lung tissues is presented in Figure 1. In Figure 1, the horizontal coordinates indicate the PIK gene expression values in 188 lung tissues, and the vertical coordinates indicate the probability that the PIK3C2B gene expression values fall within the range of the values. The histogram has some difference with the histogram of the expression values of the PIK3C2B gene on the total 1019 cell lines (see Figure  2, the horizontal coordinate indicates the expression value of the PIK3C2B gene in 1019 lung tissues, and the vertical coordinate indicates the probability that the PIK gene expression value falls within that value.), i.e., the distribution of the former is a little right-biased. Our purpose is to detect the associated expressions of other genes with the PIK3C2B gene that act on lung tissue.

Pearson correlation coefficient
Pearson correlation coefficient is widely used to measure the relationship of two random variables [15]. If random variable X has observation valued X 1 , X 2 , …, X n , and random variable Y has observation valued Y 1 , Y 2 , …, Y n , then the Pearson correlation coefficient has the following formula where |r|  1. The Pearson correlation coefficient can well describe the linear relationship degree between random variable X and Y.

Analysis results of real data
Aiming at the above-mentioned big data set, after we selected the expression values of all 56202 genes on lung issue, we obtain a 56202 by 188 data set. Firstly, we calculated 56201 Pearson correlation coefficients of the PIK3C2B gene and the other genes (see the histogram of the Pearson correlation coefficients in Figure 3 and the top big 30 correlation coefficients in Table 1). From this result, we find that most of the correlation coefficients are positive and the biggest correlation coefficient is larger than 0.6, although large part of the values are close to zero, this is to say there are exactly genes that are positively associated with the PIK3C2B gene when acting on the trait of lung cancer. In the 56201 Pearson correlation coefficients, 53586 values are non-zero.
Secondly, we further test the linear relationship of the expression value of the PIK3C2B gene with the expression values of the other genes. Taking the expression value of each gene as responsible variable, and the expression value of the PIK3C2B gene as independent variable, we construct 53586 linear regression models. To find the significant relationship among them, we conduct 53586 hypotheses testing, and make Bonferroni adjustment. The significance level 0.001 is taken and then the one for each test is 0.001/53586 after Bonferroni adjustment. The P values of the t test for all genes are calculated and we found 168 significant results among the 53586 multiple tests. The top 30 significant P values are listed in Table 2. These significant results are obtained from the aspect of statistics, so some of them may be false positive, but we wish the inference results can provide some valuable reference for the biologists and medical scientists who can do further research on the pathogenicity of these detected genes and their relationship on the trait of lung cancer.

Conclusion and discussion
In this paper, we performed extensive statistical inference on the relationship of a large amount of gene expression variables based on a real data set. From the Pearson correlation coefficients and multiple hypotheses testing, we found 168 significant results that show the related genes with the PIK3C2B gene acting on the trait of lung cancer. If these results can be further researched by biologists and medical scientists, some valuable proof for treating the complex disease of lung cancer may be found. Of course, some other correlation coefficients may be used in our analysis, and more complex models can be built to analyze this data set, so that some complementary results would be obtained. In addition, we can also study the effect on lung cancer traits when there is an interaction between genes and genes (GXG). Further research will be made in our future study.