Improving Text-Driven Image Synthesis: Diffusion Models for Photorealistic Outcomes

Authors

  • T.M.N. Vamsi
  • J. N.V.R. Swarup Kumar
  • I.S. Siva Rao
  • Pratibha Lanka

DOI:

https://doi.org/10.47839/ijc.23.4.3766

Keywords:

Text-Conditional, Diffusion Models, Photorealistic Images, CLIP, Classifier-Free Guidance, GANs (Generative Adversarial Networks), Transformer-Based Neural Network, Latent Code Encoding, Inpainting, Perceptual Loss

Abstract

In recent developments, there has been a noteworthy demonstration of the effectiveness of generating high-quality images of diffusion models. This success is further enhanced when these models are combined with a technique that allows for a strategic balance between image diversity and fidelity. Addressing the challenge of text-conditional image synthesis, we extensively explore the utility of diffusion models along with two distinct guiding approaches: CLIP (Contrastive Language–Image Pretraining) guidance and classifier-free guidance. Through a comprehensive analysis, we uncover intriguing insights. The classifier-free guidance method consistently emerges as a standout performer, producing images with remarkable photorealism. This method showed a PSNR of 183.66 dB and an SSIM of 99.99%, indicating efficient photorealism and structural similarity to ground reality images. It presents a unique approach that combines diffusion models with classifier-free guidance for text-conditional image synthesis, focusing on photorealism and alignment with captions. Therefore, it can be useful for human evaluators to proficiently maintain both visual realism and associated captions.

References

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.

B. Dayma and P. Cuenca, “Dall. e mini-generate images from any text prompt,” Weights & Biases, 2022.

S. Geng, J. Yuan, Y. Tian, Y. Chen, and Y. Zhang, “Hiclip: Contrastive language-image pretraining with hierarchy-aware attention,” arXiv preprint arXiv:2303.02995, 2023.

R. T. Hughes, L. Zhu, and T. Bednarz, “Generative adversarial networks–enabled human–artificial intelligence collaborative applications for creative and design industries: A systematic review of current approaches and trends,” Frontiers in artificial intelligence, vol. 4, p. 604234, 2021. https://doi.org/10.3389/frai.2021.604234.

D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in International conference on machine learning. PMLR, 2016, pp. 1747–1756.

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning. PMLR, 2016, pp. 1060–1069.

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5907–5915. https://doi.org/10.1109/ICCV.2017.629.

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324. https://doi.org/10.1109/CVPR.2018.00143.

T. Qiao, J. Zhang, D. Xu, and D. Tao, “Mirrorgan: Learning text-to-image generation by redescription,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1505–1514. https://doi.org/10.1109/CVPR.2019.00160.

O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for textdriven editing of natural images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218.

D. Bau, J.-Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba, “Gan dissection: Visualizing and understanding generative adversarial networks,” arXiv preprint arXiv:1811.10597, 2018.

G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2426–2435. https://doi.org/10.1109/CVPR52688.2022.00246.

A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.

W. Dong, S. Xue, X. Duan, and S. Han, “Prompt tuning inversion for text-driven image editing using diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7430–7440. https://doi.org/10.1109/ICCV51070.2023.00683.

B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017. https://doi.org/10.1109/CVPR52729.2023.00582.

Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and Y. Zheng, “Recent progress on generative adversarial networks (gans): A survey,” IEEE access, vol. 7, pp. 36 322–36 333, 2019. https://doi.org/10.1109/ACCESS.2019.2905015.

K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.

J. Liu and T. H. Lin, “A framework for the synthesis of x-ray security inspection images based on generative adversarial networks,” IEEE Access, vol. 11, pp. 63 751–63 760, 2023. https://doi.org/10.1109/ACCESS.2023.3288087.

Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6027–6037. https://doi.org/10.1109/CVPR52729.2023.00584.

P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 37–45. https://doi.org/10.1109/ICCV.2015.13.

P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13. Springer, 2014, pp. 329–344. https://doi.org/10.1007/978-3-319-10584-0_22.

M. Lucic, M. Tschannen, M. Ritter, X. Zhai, O. Bachem, and S. Gelly, “High-fidelity image generation with fewer labels,” in International conference on machine learning. PMLR, 2019, pp. 4183–4192.

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.

R. T. Whitaker and S. M. Pizer, “A multi-scale approach to nonuniform diffusion,” CVGIP: image understanding, vol. 57, no. 1, pp. 99–110, 1993. https://doi.org/10.1006/ciun.1993.1006.

A. Hore and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th International Conference on Pattern Recognition. IEEE, 2010, pp. 2366–2369. https://doi.org/10.1109/ICPR.2010.579.

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004. https://doi.org/10.1109/TIP.2003.819861.

Downloads

Published

2025-01-12

How to Cite

Vamsi, T., Swarup Kumar, J. N., Siva Rao, I., & Lanka, P. (2025). Improving Text-Driven Image Synthesis: Diffusion Models for Photorealistic Outcomes. International Journal of Computing, 23(4), 673-680. https://doi.org/10.47839/ijc.23.4.3766

Issue

Section

Articles