Multimodal Text-to-Traditional Painting Generation Via Enhanced VQGAN with CLIP and Transformer Integration
Downloads
With the development of artificial intelligence technology, text-driven image generation has gradually become a research hotspot. In terms of the application of traditional Chinese painting generation, due to the particularity of traditional Chinese painting themes, techniques, and artistic conception, existing models have problems such as cross-modal alignment deviation. To enhance the model's understanding of text semantics and improve the matching degree between text and generated Chinese painting images, a multimodal dataset of Chinese paintings is studied, and the Vector Quantized Generative Adversarial Network (VQGAN) is improved. A new multimodal text generation method is constructed by combining transformer-based bidirectional encoder representation, Transformer decoder, and contrastive language-image pre-training models. The results showed that when trained for 100 rounds on the general object dataset and Flickr30k dataset in context, the Inception Score (IS) values of this method were 39.6 and 5.1, respectively, and the Fréchet Inception Distance (FID) values were 2.3 and 1.7, respectively, which were better than models such as VQGAN. In the ablation experiment, the IS was 5.8, the FID was 10.3, and the cosine similarity was 0.82. The convergence was achieved after 60 rounds, which was better than other variants. The method proposed in the study has good adaptability to the generation of traditional Chinese painting. Although the expansion of complex scenes is slightly weak, the overall performance is excellent, filling the gap in traditional artistic semantic mapping and providing support for the digital dissemination and innovation of traditional Chinese painting.
Downloads
[1] Mathew, S. (2024). An Overview of Text to Visual Generation Using GAN. Indian Journal of Image Processing and Recognition, 4(3), 1–9. doi:10.54105/ijipr.a8041.04030424.
[2] Yao, M., Zhang, Y., Lin, X., Li, X., & Zuo, W. (2024). VQ-Font: Few-Shot Font Generation with Structure-Aware Enhancement and Quantization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), 16407–16415. doi:10.1609/aaai.v38i15.29577.
[3] Ren, J., Qin, J., Ma, Q., & Cao, Y. (2024). FastFaceCLIP: A lightweight text-driven high-quality face image manipulation. IET Computer Vision, 18(7), 950–967. doi:10.1049/cvi2.12295.
[4] Karkuzhali, S., Aasim, A. S., & StalinRaj, A. (2025). Text-driven clothed human image synthesis with 3D human model estimation for assistance in shopping. Multimedia Tools and Applications, 84(1), 167–200. doi:10.1007/s11042-024-20187-x.
[5] Lee, S. (2023). Transforming Text into Video: A Proposed Methodology for Video Production Using the VQGAN-CLIP Image Generative AI Model. International Journal of Advanced Culture Technology, 11(3), 225–230. doi:10.17703/IJACT.2023.11.3.225.
[6] Ai, H., Cao, Z., Lu, H., Chen, C., Ma, J., Zhou, P., Kim, T. K., Hui, P., & Wang, L. (2024). Dream360: Diverse and Immersive Outdoor Virtual Scene Creation via Transformer-Based 360° Image Outpainting. IEEE Transactions on Visualization and Computer Graphics, 30(5), 2734–2744. doi:10.1109/TVCG.2024.3372085.
[7] Lim, J., Cha, K., Koh, J., & Hong, W. K. (2024). Research on Generative AI for Korean Multi-Modal Montage App. Journal of Service Research and Studies, 14(1), 13-26. doi:10.18807/jsrs.2024.14.1.013.
[8] Jiang, Y., Yang, S., Qju, H., Wu, W., Loy, C. C., & Liu, Z. (2022). Text2Human: Text-Driven Controllable Human Image Generation. ACM Transactions on Graphics, 41(4), 1–11. doi:10.1145/3528223.3530104.
[9] Yadav, N., Sinha, A., Jain, M., Agrawal, A., & Francis, S. (2024). Generation of Images from Text Using AI. International Journal of Engineering and Manufacturing, 14(1), 24–37. doi:10.5815/ijem.2024.01.03.
[10] Ghazvineh, A. (2024). An inter-semiotic analysis of ideational meaning in text-prompted AI-generated images. Language and Semiotic Studies, 10(1), 17–42. doi:10.1515/lass-2023-0030.
[11] Fan, F., Luo, C., Gao, W., & Zhan, J. (2023). AIGCBench: Comprehensive evaluation of image-to-video content generated by AI. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 3(4), 100152–100161. doi:10.1016/j.tbench.2024.100152.
[12] Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., & Cohen-Or, D. (2023). Attend-And-Excite: Attention-Based Semantic Guidance for Text-To-Image Diffusion Models. ACM Transactions on Graphics, 42(4), 1–10. doi:10.1145/3592116.
[13] Zhan, F., Yu, Y., Wu, R., Zhang, J., Lu, S., Liu, L., Kortylewski, A., Theobalt, C., & Xing, E. (2023). Multimodal Image Synthesis and Editing: The Generative AI Era. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12), 15098–15119. doi:10.1109/TPAMI.2023.3305243.
[14] Amoroso, R., Morelli, D., Cornia, M., Baraldi, L., Del Bimbo, A., & Cucchiara, R. (2024). Parents and Children: Distinguishing Multimodal Deepfakes from Natural Images. ACM Transactions on Multimedia Computing, Communications and Applications, 21(1), 1–23. doi:10.1145/3665497.
[15] Zhang, Z., Xing, Z., Zhao, D., Xu, X., Zhu, L., & Lu, Q. (2024). Automated Refactoring of Non-Idiomatic Python Code with Pythonic Idioms. IEEE Transactions on Software Engineering, 50(11), 2827–2848. doi:10.1109/TSE.2024.3420886.
[16] Ma, J., Ye, Y., & Chen, J. (2024). Self-attention residual network-based spatial super-resolution synthesis for time-varying volumetric data. IET Image Processing, 18(6), 1579–1597. doi:10.1049/ipr2.13050.
[17] Kashyap, R. (2023). Histopathological image classification using dilated residual grooming kernel model. International Journal of Biomedical Engineering and Technology, 41(3), 272–299. doi:10.1504/IJBET.2023.129819.
[18] Mewada, A., & Dewang, R. K. (2023). SA-ASBA: a hybrid model for aspect-based sentiment analysis using synthetic attention in pre-trained language BERT model with extreme gradient boosting. Journal of Supercomputing, 79(5), 5516–5551. doi:10.1007/s11227-022-04881-x.
[19] Qi, Y., Yang, F., Zhu, Y., Liu, Y., Wu, L., Zhao, R., & Li, W. (2023). Exploring Stochastic Autoregressive Image Modeling for Visual Representation. Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, 37(2), 2074–2081. doi:10.1609/aaai.v37i2.25300.
[20] Mitrofanov, E., & Grishkin, V. (2024). Generative Adversarial Networks Quantization. Physics of Particles and Nuclei, 55(3), 563–565. doi:10.1134/S1063779624030596.
[21] Kong, Y., Lu, M., & Ma, Z. (2024). Generative Refinement for Low Bitrate Image Coding Using Vector Quantized Residual. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 14(2), 185–197. doi:10.1109/JETCAS.2024.3385653.
[22] Dubey, S. R., & Singh, S. K. (2024). Transformer-Based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey. IEEE Transactions on Artificial Intelligence, 5(10), 4851–4867. doi:10.1109/TAI.2024.3404910.
[23] Wang, T., Zhang, K., Shao, Z., Luo, W., Stenger, B., Lu, T., Kim, T. K., Liu, W., & Li, H. (2024). GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions. International Journal of Computer Vision, 132(10), 4541–4563. doi:10.1007/s11263-024-02056-0.
[24] Lei, H., Zhang, J., Xiao, H., Zhang, X., Ai, B., & Ng, D. W. K. (2024). Channel Estimation for XL-MIMO Systems with Polar-Domain Multi-Scale Residual Dense Network. IEEE Transactions on Vehicular Technology, 73(1), 1479–1484. doi:10.1109/TVT.2023.3311010.
[25] Shi, F., Gao, R., Huang, W., & Wang, L. (2024). Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2), 1181–1198. doi:10.1109/TPAMI.2023.3328185.
[26] Ahmed, S., Nielsen, I. E., Tripathi, A., Siddiqui, S., Ramachandran, R. P., & Rasool, G. (2023). Transformers in Time-Series Analysis: A Tutorial. Circuits, Systems, and Signal Processing, 42(12), 7433–7466. doi:10.1007/s00034-023-02454-8.
[27] Liu, B., Lu, D., Wei, D., Wu, X., Wang, Y., Zhang, Y., & Zheng, Y. (2023). Improving Medical Vision-Language Contrastive Pretraining with Semantics-Aware Triage. IEEE Transactions on Medical Imaging, 42(12), 3579–3589. doi:10.1109/TMI.2023.3294980.
[28] Zhang, L., Zhou, X., Zeng, Z., & Shen, Z. (2025). Multimodal Pre-training for Sequential Recommendation via Contrastive Learning. ACM Transactions on Recommender Systems, 3(1), 1–23. doi:10.1145/3682075.
[29] Sun, Z. L., Yang, G. X., Wen, J. Y., Fei, N. Y., Lu, Z. W., & Wen, J. R. (2023). Text-to-Chinese-painting Method Based on Multi-domain VQGAN. Ruan Jian Xue Bao/Journal of Software, 34(5), 2116–2133. doi:10.13328/j.cnki.jos.006769.
[30] Huang, Z., Zhao, N., & Liao, J. (2022). Unicolor: A unified framework for multi-modal colorization with transformer. ACM Transactions on Graphics, 41(6), 1–16. doi:10.1145/3550454.3555471.
- This work (including HTML and PDF Files) is licensed under a Creative Commons Attribution 4.0 International License.





















