Mitigating Data Scarcity in Cancer Classification with Synthetic Data

Madni, H. A.; Shujat, H.; Zottin, S.; De Nardin, A.; Foresti, G. L.

doi:10.1109/ACCESS.2026.3678764

Clinical datasets are often limited in availability and subject to strict privacy regulations, posing significant challenges for the development of accurate classification models. Synthetic data generation offers a promising alternative, enabling the augmentation of training datasets while preserving patient privacy. However, generating high-fidelity synthetic images that effectively support model training remains a significant challenge. In this paper, we focus on the classification of colorectal and lung carcinoma as representative tasks in cancer clinical diagnostics. We utilize a stable diffusion model enhanced with Low-Rank Adaptation (LoRA) weights to generate synthetic images from a limited number of real images used for fine-tuning. This method results in a performance improvement of the DeiT-L (Data-efficient Image Transformer-Large) and CLIP (Contrastive Language-Image Pretraining) models for colon and lung datasets, respectively, when trained on excessive synthetic data and a few real samples. Synthetic data closely mirrors the real samples, mitigating the issues of data scarcity and privacy, while enhancing model generalization. These findings support the use of synthetic data as a viable tool in cancer disease classification and demonstrate its potential to strengthen deep learning applications in clinical diagnostics and medical research. Experimental results on colon and lung histopathology datasets demonstrate that augmenting limited real data with diffusion-generated synthetic images consistently improves classification accuracy and generalization across multiple deep learning architectures, particularly in low-data (few-shot) regimes. Notably, the hybrid configuration combining a small number of real samples with synthetic data yields the most practically significant performance gains, highlighting its relevance for real-world clinical settings where annotated data are scarce. Code is available at: https://github.com/h-ahmad/rare_disease_classification

Mitigating Data Scarcity in Cancer Classification with Synthetic Data

Madni H. A.;Shujat H.;Zottin S.;De Nardin A.;Foresti G. L.

2026-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Mitigating Data Scarcity in Cancer Classification with Synthetic Data

Madni H. A.;Shujat H.;Zottin S.;De Nardin A.;Foresti G. L.

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)