With the rapid development of the digital media industry, users’ demand for personalized, efficient, and controllable content generation is becoming increasingly urgent. Generative Adversarial Networks (GANs) have become a key technology for addressing this demand due to their excellent data distribution learning and new content synthesis capabilities. However, current generative adversarial networks still face several limitations in digital media generation, such as pattern collapse, relatively limited controllability, and insufficient collaboration between multiple modalities. In response to the above challenges, this paper proposes a generative adversarial network framework called Ctrl GAN that integrates conditional control, multi-scale feature optimization, and latent spatial semantic alignment. Specifically, this study improved the mapping network structure of StyleGAN2 to enhance the clarity and disentanglement of semantic expressions in latent space; Simultaneously introducing an attribute condition module to achieve precise control over the visual attributes of generated content; In addition, by enhancing the multi-scale feature extraction capability of the discriminator, the realism of generated details can be further improved. Experiments have shown that Ctrl GAN reduces the Fréchet Inception Distance index to 18.2 in COCO scene image generation tasks, which is a decrease of 12.6% compared to the benchmark model StyleGAN2. On the CelebA face dataset, this method achieved a 91.5% accuracy in attribute control, which is an 8.3% improvement compared to ControlGAN. In the multimodal generation task, the model achieved a satisfaction score of 4.6 out of 5 in user evaluation, significantly better than other compared models.
Sun et al. (Thu,) studied this question.