Multimodal Text-to-Traditional Painting Generation Via Enhanced VQGAN with CLIP and Transformer Integration

VQGAN Contrastive Language-Image Pre-Training Models Multimodal Text-Driven Chinese Painting Image Generation Transformer

Authors

Downloads

With the development of artificial intelligence technology, text-driven image generation has gradually become a research hotspot. In terms of the application of traditional Chinese painting generation, due to the particularity of traditional Chinese painting themes, techniques, and artistic conception, existing models have problems such as cross-modal alignment deviation. To enhance the model's understanding of text semantics and improve the matching degree between text and generated Chinese painting images, a multimodal dataset of Chinese paintings is studied, and the Vector Quantized Generative Adversarial Network (VQGAN) is improved. A new multimodal text generation method is constructed by combining transformer-based bidirectional encoder representation, Transformer decoder, and contrastive language-image pre-training models. The results showed that when trained for 100 rounds on the general object dataset and Flickr30k dataset in context, the Inception Score (IS) values of this method were 39.6 and 5.1, respectively, and the Fréchet Inception Distance (FID) values were 2.3 and 1.7, respectively, which were better than models such as VQGAN. In the ablation experiment, the IS was 5.8, the FID was 10.3, and the cosine similarity was 0.82. The convergence was achieved after 60 rounds, which was better than other variants. The method proposed in the study has good adaptability to the generation of traditional Chinese painting. Although the expansion of complex scenes is slightly weak, the overall performance is excellent, filling the gap in traditional artistic semantic mapping and providing support for the digital dissemination and innovation of traditional Chinese painting.