Generative AI for Text-to-Image Synthesis: A Review of Current Techniques
Keywords:
Artificial Intelligence (AI), Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Inception Score (IS), Fréchet Inception Distance (FID), CLIP-Guided Generation, Attention Mechanisms, Style TransferAbstract
Text-to-image synthesis Text-to-image synthesis is a relatively new domain of Generative Artificial Intelligence (AI) that aims to generate coherent and visually meaningful images based on natural descriptions. language. Within this chapter, there is a broad survey given on the present techniques and models, as well as applications that form the backbone of this fast-developing area. We start by discussing the generative framework on which all the subsequent frameworks draw, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models, each with its own advantages and weaknesses in text-to-visual transfer. In the context of evaluating realism, diversity, and semantic alignment, evaluation metrics such as the Inception Score (IS), Fréchet Inception Distance (FID), and user studies are mentioned. Significant datasets to train and benchmark, such as MS-COCO, Oxford pets, and CelebA, are also reviewed in the chapter. Besides that, we discuss some critical areas like semantic interpretation, image resolution, and output variability in addition to recent developments, including CLIP-guided generation, attention, and style transfer. Usages in artistic, advertising, and gaming are mentioned first, with a section on ethical questions as well, such as discrimination, intellectual property rights, and possible abuse. Lastly, we provide future directions in the context of creating more robust models as well as increase user interaction and increase cross-modal capabilities. This review will be a reference starting point among the researchers, developers, and practitioners who take part in the design and implementation of generative AI for the synthesis of text-to-image.
References
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020. https://dl.acm.org/doi/abs/10.5555/3495724.3496298
A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” Computer Vision and Pattern Recognition, Feb. 2021, Available: https://arxiv.org/abs/2103.00020
A. Ramesh, A. Pavlov, G. Goh, S. Gray, “Zero-Shot Text-to-Image Generation,” In International Conference on machine learning, Feb. 2021, Available: https://arxiv.org/abs/2102.12092
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models,” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Apr. 2022, Available: https://arxiv.org/abs/2112.1075
H. Zhang, T. Xu, H. Li, “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks,” 2016. https://arxiv.org/abs/1612.03242
T. Xu, X. Zhang, Q. Huang, “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, doi: https://doi.org/10.1109/cvpr.2018.00143.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical Text-Conditional Image Generation with CLIP Latents,” arXiv:2204.06125, Apr. 2022, Available: https://arxiv.org/abs/2204.06125
C. Saharia, W. Chan, and S. Saxena, “Photorealistic text-to-image diffusion models with deep language understanding,” NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022. https://dl.acm.org/doi/abs/10.5555/3600270.3602913
L. Yang, L. Zhang, Y. Song, S. Hong, “Diffusion Models: A Comprehensive Survey of Methods and Applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, Nov. 2023, doi: https://doi.org/10.1145/3626235.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative Adversarial Text to Image Synthesis,” arXiv.org, 2016. https://arxiv.org/abs/1605.05396
M. Zhu, P. Pan, W. Chen, and Y. Yang, “DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, doi: https://doi.org/10.1109/cvpr.2019.00595.
M. Tao, H. Tang, F. Wu, X. Jing, B.-K. Bao, and C. Xu, “DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, doi: https://doi.org/10.1109/cvpr52688.2022.01602.
O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, doi: https://doi.org/10.1109/iccv48922.2021.00209.
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,” arXiv:2112.10741, Mar. 2022, Available: https://arxiv.org/abs/2112.10741
R. Gal, Y. Alaluf, Y. Atzmon, “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion,” arXiv:2208.01618, Aug. 2022, Available: https://arxiv.org/abs/2208.01618