Construction of risk prediction model of type 2 diabetes mellitus based on logistic regression

Objective: to construct multi factor prediction model for the individual risk of T2DM, and to explore new ideas for early warning, prevention and personalized health services for T2DM. Methods: using logistic regression techniques to screen the risk factors for T2DM and construct the risk prediction model of T2DM. Results: Male's risk prediction model logistic regression equation: logit(P)=BMI × 0.735+ vegetables × ( -0.671) + age × 0.838+ diastolic pressure × 0.296+ physical activity× ( -2.287) + sleep ×( -0.009) +smoking ×0.214; Female's risk prediction model logistic regression equation: logit(P)=BMI ×1.979+ vegetables× (-0.292) + age × 1.355+ diastolic pressure× 0.522+ physical activity × ( -2.287) + sleep × (-0.010).The area under the ROC curve of male was 0.83, the sensitivity was 0.72, the specificity was 0.86, the area under the ROC curve of female was 0.84, the sensitivity was 0.75, the specificity was 0.90. Conclusion: This study model data is from a compared study of nested case, the risk prediction model has been established by using the more mature logistic regression techniques, and the model is higher predictive sensitivity, specificity and stability.


Introduction
With the rapid development of social economy, people's life style and dietary structure have changed and population aging intensified, diabetes prevalence rate rose rapidly, especially in our country [1], it has brought heavy economic burden to society and family.In recent years, foreign countries have tended to use a risk assessment tool to the risk of type 2 diabetes mellitus(T2DM) for prediction and risk score, for finding the early identification of high-risk groups, to control risk factors by carrying out the health education and lifestyle intervention , and then reduce the T2DM [2].Some domestic scholars also constructed T2DM risk model with risk factors scoring method and OR data score and so on, to predict high-risk T2DM;and, in this aspect, they did some propaganda and education to make people accept the T2DM risk individual prediction model [3][4] in the identification of high-risk a Corresponding author: Shumei Li, gnyxylsm@163.com.Phone: 008615083787928.# These authors contributed equally to this study and share first authorship.This study was supported by scientific fund from National Natural Science Fund in China (No.81360445), Science and technology support program of Jiangxi Province (NO.20132BBG70086).

Research methods
Researchers randomly selected 2 / 3 samples (5391persons) from MSSQL database as the training group, the pre-selected variables is age, sex, BMI, waist circumference and waist to hip ratio (WHR), blood pressure (BP), T2DM family history, physical activity, smoking and drinking, FBG, TC, TG, HDL-C, LDL-C; and then made single factor analysis for the pre-seleced variables .Multivariate analysis mainly adopted mature logistic regression model, as to whether diabetes as the dependent variable (diabetic patient with the value 1, non diabetic patient with the value 0), other variables (suspected risk factor variables) as independent variables were multivariate logistic regression, logistic(p)= ln( p p − 1 ),to construct the T2DM risk prediction model.In order to increase the prediction accuracy, prediction model of gender were fitted.Another 1/3 sample (2695 people) in the study cohort was used as the test group, the area under the receiver operating characteristic curve (ROC) was used to analyze the sensitivity and specificity of the model, to evaluate the predictive effect of the model .
3.4 Relationship between blood pressure and the prevalence of T2DM

Relationship between total cholesterol (TC) and T2DM
Triglyceride sample was divided into three groups: low group ,TG < 1.70 mmol / L, the prevalence of T2DM was 6.0% (132/2209); intermediate group, 1.70 mmol/L≤TG<2.26mmol/L, the prevalence of T2DM is 9.5% (46/482); high group , TG≥2.26 mmol/L ), the prevalence of T2DM was 17.9% (82/459).Compared with the three groups between male and female, TG increased , the prevalence of T2DM increased, the difference was statistically significant ( p < 0.001).(seeTable 8).The subjects were divided into two groups, low exposure group (no smoking) and high exposure group (smoking): there was no significant difference in the prevalence of low exposure group and high exposure group ( 2x =0.831, p>0.05).

dietary factors and T2DM
Compared the people who often eat coarse grain with the people who don't eat the coarse grain, male's OR was 0.41, 95% confidence interval (0.25, 0.67), The differences of prevalence have statistical significance ( 2 x =13.12.p< 0.001), coarse grain is the protective factor; Female's OR is 0.73, 95% confidence interval for (0.47, 1.50), the differences of prevalence have statistical significance ( 2x =13.18, p < 0.05); coarse grains is the protective factor.Compared the people who often edible meat with the people who don't eat meat,male's OR is 2.37, 95% confidence interval (1.46, 3.84), the differences in prevalence have statistical significance ( 2 x =12.92, p < 0.001), meat is the risk factor ; The difference of female prevalence have no statistical significance ( 2x = 0.26, p> 0.05).Compared the people who often edible vegetable with the people who don't edible the vegetables, male's OR is 0.59, 95% confidence interval for (0.37, 0.93), the differences in prevalence have statistical significance ( 2x =5.16, p < 0.05),vegetable is the protective factor; Female's OR is 0.50, 95% confidence interval for (0.32, 0.77),the differences in prevalence have statistical significance ( 2x =9.92, p < 0.05); vegetable is the protective factor.

is of logistic regression
The analysis of Significant predictive variables had showed : male subjects in the study taking BMI, age, diastolic blood pressure, smoking as the risk factor, and vegetable intake, physical activity, sleep as the protective factor (β<0) ,and risk prediction model regression equation logit (P) = BMI ×0.735+ vegetables× (-0.671) + age ×0.838+ diastolic pressure × 0.296+ physical activity×( -2.287) sleep ×( -0.009) +smoking ×0.214.The female subjects in the study taking BMI, age, diastolic pressure as the risk factor, and vegetable intake, physical activity, sleep as the protective factor, risk prediction model regression equation logit (P) = BMI ×1.979+ vegetables × (-0.292) + age× 1.355+ diastolic pressure × 0.522+ physical activity × (-2.287) + sleep × (-0.010).(see   gradually developed and applicated in foreign country [5][6][7][8][9], but due to the difference in race, behavior and lifestyle, diet, medical and health services, economic and cultural factors, foreign risk assessment tool was not fully applicable to Asian and / or Chinese [10][11].Therefore, since 2006, scholars in our country [12][13][14] have concluded that the main risk factors of diabetes mellitus and its relative risk to establish evaluation methods and evaluation model of adult individual diabetes risk in our country ,on the of epidemiological survey of diabetes ,by risk factors scoring method, disease risk score, the analysis of diabetes risk factors and prevalence data.These models mostly based on the compared study cases , prospective study was insufficient; Our study team has fully considered the conditions which required for robust risk model, include stable and representative large samples , a prospective observation, multivariate estimation and mature computer technology platform .But the research is still in initial stage with many remaining problems: the first is the study did not exclude the influence of the linear of predictive variables in model to the stability of the model; researchers only evaluate the predictive model by the similar people , part of the data is not so complete; We will evaluate and verificate this model by screening return visitors in the latter part of the study .

Figure 1 .
Figure 1.Sensitivity and specificity of the model.

2 Research design and program 2.1 Research objects
Researchers chose the sample from T2DM baseline survey for the project study in 19 districts and counties in Ganzhou City in 2009.The study subject is a total of 8086 copies of data, age 35-64 years old.The baseline survey and follow-up data content include general demographic characteristics, smoking and drinking history, personal health history, diabetes family history, physical activity and physical exercise and so on; human body measurements included height, weight, BMI, waist circumference, hip circumference, waist hip ratio (WHR), blood pressure, lung capacity and so on ; biochemical metabolic index included fasting blood glucose (FBG), total cholesterol (TC), triglyceride (TG), high density lipoprotein cholesterol (HDL-C), low density lipoprotein cholesterol (LDL-C), blood uric acid (UA) (save all to MSSQL database).

Table 1 .
Different gender and age in T2DM prevalence rate.

Table 2 .
T2DM prevalence rate of different BMI In different gender.

Table 3 .
Relationship between waist circumference and T2DM

Table 7 .
Regardless of gender, the relationship between cholesterol and prevalence rate of T2DM.

Table 8 .
Relationship between triglyceride index and the prevalence rate of T2DM.

Table 9 .
The Results of logistic regression analysis of male relative indexes.

Table 10 .
The Results of logistic regression analysis of female related indexes.Male' s area under the ROC curve area was 0.83, sensitivity was 0.72, specificity was 0.86; female's area under ROC curve was 0.84, sensitivity was 0.75, specificity was 0.90, two sets of data showed that model had a high predictive value.(seeFigure1).