Abstract
Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions, a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM (e.g. Stable Diffusion) provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. Experiments on the public ISIC2019 skin lesion dataset demonstrate that the proposed method improves (a) the quality of the generated images, and (b) the alignment with the text prompt over the original fine-tuned Stable Diffusion baseline. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.
Method
Key Contributions
- We introduce RL4Med-DDPO, the first framework to integrate policy optimization with vision-language foundation models for improved text-guided medical image synthesis using Stable Diffusion.
- We show that reinforcement learning (via DDPO) enhances the alignment between medical text prompts and the generated images, addressing common issues of semantic mismatch and spurious artifacts.
- We propose a new evaluation metric, Artifact Prevalence Rate (APR), to quantify the presence of desired attributes and artifacts in synthesized images.
- Extensive experiments on the public ISIC2019 skin lesion dataset demonstrate that our method generates photorealistic images with improved text-image alignment, supporting more robust and bias-aware medical image synthesis.

Results


Conclusion
In this work, we demonstrate the first method showing alignment between the text prompt and image generation using a vision-foundation model guided by a policy optimization for medical imaging applications. We show through extensive qualitative and quantitative validation that these images align well with the input text prompt, and they are helpful for downstream tasks such as augmenting the classifier to improve performance over minority classes. Future work will explore the use of diverse policies for complex tasks such as subgroup clustering in the latent space for disease image marker discovery.
BibTeX
@article{saremi2025rl4med, title={Rl4med-ddpo: Reinforcement learning for controlled guidance towards diverse medical image generation using vision-language foundation models}, author={Saremi, Parham and Kumar, Amar and Mohammed, Mohammed and TehraniNasab, Zahra and Arbel, Tal}, journal={arXiv preprint arXiv:2503.15784}, year={2025} }