Abstract:Aimed at the problems that generative adversarial network is incomplete in structure, unreal in content and poor in quality of images generated, an attention mechanism text-to-image generation model combined with semantic segmentation graph (SSA-GAN) is proposed. First, taking global sentence vectors as input conditions, a simple and effective deep fusion module is utilized for fully fusing text information while generating images are generating simultaneously. Second, the semantically segmented images are combined to extract their edge profile features to provide additional generative and constraint conditions for the model, and the attention mechanism is used to provide fine-grained word-level information for the model to enrich the details of the generated images. Finally, a multimodal similarity computation model is used to compute fine-grained image-text matching loss to further train the generator. The model is tested and validated by CUB-200 and Oxford-102 Flowers datasets, and the results show that the proposed model (SSA-GAN) improves the quality of the final generated images. Compared to the models such as StackGAN, AttnGAN, DF-GAN, and RAT-GAN, the IS increases in metrics values by 13.7% and 43.2%, respectively. And the FID in metric values is reduced to 34.7% and 74.9%, respectively.