RL4Med-DDPO

Abstract

Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions, a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM (e.g. Stable Diffusion) provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. Experiments on the public ISIC2019 skin lesion dataset demonstrate that the proposed method improves (a) the quality of the generated images, and (b) the alignment with the text prompt over the original fine-tuned Stable Diffusion baseline. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.

Method

Key Contributions

We introduce RL4Med-DDPO, the first framework to integrate policy optimization with vision-language foundation models for improved text-guided medical image synthesis using Stable Diffusion.
We show that reinforcement learning (via DDPO) enhances the alignment between medical text prompts and the generated images, addressing common issues of semantic mismatch and spurious artifacts.
We propose a new evaluation metric, Artifact Prevalence Rate (APR), to quantify the presence of desired attributes and artifacts in synthesized images.
Extensive experiments on the public ISIC2019 skin lesion dataset demonstrate that our method generates photorealistic images with improved text-image alignment, supporting more robust and bias-aware medical image synthesis.

Overview of RL4Med-DDPO pipeline — Figure 2: Proposed architecture for policy optimization using a reward function for diverse and realistic image generation using fine-tuned Stable Diffusion. Given an input text prompt, the model synthesizes a realistic image x_g during the reverse diffusion. This generated image is then passed to a pre-trained classifier to compute the reward which helps guide the denoising UNet to improve image synthesis so it is better semantically aligned with the input text.

Results

Disentanglement property of Stable Diffusion — Figure 3: Comparing real samples for the category "melanocytic nevus with gel bubbles" with the synthesized images using fine-tuned Stable Diffusion (SD) and fine-tuned Stable Diffusion with reinforcement learning (SD+RL). Note the unwanted artifacts present in the image synthesized by SD.

t-SNE plot of generated latent vectors — Figure 4: Qualitative comparisons of synthesized images of subgroups (based on combinations of disease and artifacts) for which none or a few (less than 20) real samples are present. Note that some of these subgroups include combinations of attributes, such as melanoma with gel bubbles and ink or melanocytic nevus with ink and hair.

Conclusion

In this work, we demonstrate the first method showing alignment between the text prompt and image generation using a vision-foundation model guided by a policy optimization for medical imaging applications. We show through extensive qualitative and quantitative validation that these images align well with the input text prompt, and they are helpful for downstream tasks such as augmenting the classifier to improve performance over minority classes. Future work will explore the use of diverse policies for complex tasks such as subgroup clustering in the latent space for disease image marker discovery.

BibTeX

@article{saremi2025rl4med,
  title={Rl4med-ddpo: Reinforcement learning for controlled guidance towards diverse medical image generation using vision-language foundation models},
  author={Saremi, Parham and Kumar, Amar and Mohammed, Mohammed and TehraniNasab, Zahra and Arbel, Tal},
  journal={arXiv preprint arXiv:2503.15784},
  year={2025}
}