Clinical datasets are often limited in availability and subject to strict privacy regulations, posing significant challenges for the development of accurate classification models. Synthetic data generation offers a promising alternative, enabling the augmentation of training datasets while preserving patient privacy. However, generating high-fidelity synthetic images that effectively support model training remains a significant challenge. In this paper, we focus on the classification of colorectal and lung carcinoma as representative tasks in cancer clinical diagnostics. We utilize a stable diffusion model enhanced with Low-Rank Adaptation (LoRA) weights to generate synthetic images from a limited number of real images used for fine-tuning. This method results in a performance improvement of the DeiT-L (Data-efficient Image Transformer-Large) and CLIP (Contrastive Language-Image Pretraining) models for colon and lung datasets, respectively, when trained on excessive synthetic data and a few real samples. Synthetic data closely mirrors the real samples, mitigating the issues of data scarcity and privacy, while enhancing model generalization. These findings support the use of synthetic data as a viable tool in cancer disease classification and demonstrate its potential to strengthen deep learning applications in clinical diagnostics and medical research. Experimental results on colon and lung histopathology datasets demonstrate that augmenting limited real data with diffusion-generated synthetic images consistently improves classification accuracy and generalization across multiple deep learning architectures, particularly in low-data (few-shot) regimes. Notably, the hybrid configuration combining a small number of real samples with synthetic data yields the most practically significant performance gains, highlighting its relevance for real-world clinical settings where annotated data are scarce. Code is available at: https://github.com/h-ahmad/rare_disease_classification

Mitigating Data Scarcity in Cancer Classification with Synthetic Data

Madni H. A.
;
Zottin S.;De Nardin A.;Foresti G. L.
2026-01-01

Abstract

Clinical datasets are often limited in availability and subject to strict privacy regulations, posing significant challenges for the development of accurate classification models. Synthetic data generation offers a promising alternative, enabling the augmentation of training datasets while preserving patient privacy. However, generating high-fidelity synthetic images that effectively support model training remains a significant challenge. In this paper, we focus on the classification of colorectal and lung carcinoma as representative tasks in cancer clinical diagnostics. We utilize a stable diffusion model enhanced with Low-Rank Adaptation (LoRA) weights to generate synthetic images from a limited number of real images used for fine-tuning. This method results in a performance improvement of the DeiT-L (Data-efficient Image Transformer-Large) and CLIP (Contrastive Language-Image Pretraining) models for colon and lung datasets, respectively, when trained on excessive synthetic data and a few real samples. Synthetic data closely mirrors the real samples, mitigating the issues of data scarcity and privacy, while enhancing model generalization. These findings support the use of synthetic data as a viable tool in cancer disease classification and demonstrate its potential to strengthen deep learning applications in clinical diagnostics and medical research. Experimental results on colon and lung histopathology datasets demonstrate that augmenting limited real data with diffusion-generated synthetic images consistently improves classification accuracy and generalization across multiple deep learning architectures, particularly in low-data (few-shot) regimes. Notably, the hybrid configuration combining a small number of real samples with synthetic data yields the most practically significant performance gains, highlighting its relevance for real-world clinical settings where annotated data are scarce. Code is available at: https://github.com/h-ahmad/rare_disease_classification
File in questo prodotto:
File Dimensione Formato  
Mitigating_Data_Scarcity_in_Cancer_Classification_With_Synthetic_Data.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.79 MB
Formato Adobe PDF
2.79 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1329728
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact