| Issue |
BIO Web Conf.
Volume 228, 2026
Biospectrum 2025: International Conference on Biotechnology and Biological Science
|
|
|---|---|---|
| Article Number | 06002 | |
| Number of page(s) | 7 | |
| Section | Medical Biotechnology II | |
| DOI | https://doi.org/10.1051/bioconf/202622806002 | |
| Published online | 11 March 2026 | |
Exploring Haematological Biomarkers through Data Mining for Multi-Disease Prediction
1 Assistant Professor, Institute of Engineering & Management, University of Engineering & Management, Kolkata, India
2 Associate Professor, Institute of Engineering & Management, University of Engineering & Management, Kolkata, India
3 Research Scholar, Institute of Engineering & Management, University of Engineering & Management, Kolkata, India
* Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.
Abstract
Background: Complete blood count (CBC) parameters are non-specific disease indicators, limiting their diagnostic utility when used individually.
Objective: To develop and validate a machine learning model combining multiple CBC biomarkers (TLC, PCV, PLT, RDW, HGB) for early screening of patients with abnormal blood profiles suggestive of autoimmune disorders and/or malignancies.
Methods: CBC data were analysed from a Kaggle dataset comprising 364 patients. Outlier binary indicators used NIH reference ranges. A logistic regression model was established (n = 292, 80%) and verified (n = 72, 20%) by repeated k-fold cross-validation with a tenfold. VIF < 5 was used to assess multicollinearity.
Results: The model's AUC is 0.886, with 10-fold cross-validation accuracy of 78% for IMAGE, sensitivity of 88.6%, specificity of 50%, and precision of 82.4%. Four predictors were significantly associated: TLC (p<0.001), PCV (p<0.001), PLT (p=0.003), and RDW (p=0.008); HGB was not significantly associated (p=0.142). The model detected 72.5% of patients with at least one CBC abnormality for clinical follow-up, and 3.02% with multiple concurrent abnormalities. Gender differences were observed (male: 35.4% positive, female: 17% positive).
Conclusion: This proof-of-concept demonstrates that logistic regression modelling of CBC outliers can identify high-risk patients for further diagnostic workup. However, the low specificity (50%) and lack of confirmed diagnoses limit clinical applicability. External validation on larger, multi-centre datasets with verified disease outcomes is required before clinical implementation.
Key words: Complete Blood Count / Machine Learning / Logistic Regression / Cancer Screening / Autoimmune Disease / Liquid Biopsy / Early Detection
© The Authors, published by EDP Sciences, 2026
This is an Open Access article distributed under the terms of the Creative Commons Attribution License 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

