Under the background of the Rural Revitalization Strategy, Zhejiang Province is promoting “Zhejiang-style Vernacular Dwellings” as a crucial measure to enhance the rural living environment and architectural appearance. However, traditional stylistic control tools, such as standardized rural housing design atlases, exhibit limitations including weak responsiveness to villagers’ individualized needs and high professional thresholds. Consequently, they struggle to address the bottlenecks in grassroots governance efficiency caused by massive and personalized housing demands. Meanwhile, when applied to architectural design, general generative AI technologies often suffer from “structural hallucinations” and the weakening of regional characteristics due to a lack of physical tectonic constraints. Oriented towards the governance requirements of the Zhejiang Provincial Rural Housing Design Guidelines, this study proposes a compliance evaluation-driven “Contour-Semantic-Image” hierarchical generative control framework. This aims to construct a visual scheme generation and pre-screening workflow that deeply adapts to the logic of rural governance. At the data level, this research aggregates multi-source materials, including official standardized atlases, government stylistic guidelines, and real-world photographs. Through expert screening and standardized processing of 596 schemes, a dataset of 333 high-quality, finely annotated structured samples is constructed. Furthermore, a human-guided, machine-segmented workflow assisted by Segment Anything Model 2 (SAM 2) is employed to establish a semantic label system comprising 4 major categories and 13 subcategories of components, thereby achieving the structural deconstruction of architectural prior knowledge. At the generation level, a two-stage model is trained based on Stable Diffusion and ControlNet: Stage I utilizes contour conditions and “layout prompts” to generate semantic label maps, aiming to strengthen component topology and layout consistency; Stage II employs the semantic label maps and “style prompts” as conditions to generate photorealistic facade images. By utilizing explicit semantic constraints to guide the model from pixel synthesis to logical generation, it achieves the controllable rendering of stylistic details and material expressions. At the evaluation level, an automated verification system featuring “clause translation–metric calculation–comprehensive scoring” is proposed. It conducts scoring, re-ranking, and diagnostic feedback on the generated variants across three dimensions: Design Rationality (Q), General Compliance (G), and Jiangnan water-town Regional Characteristics (P-J), forming a closed-loop “Generation-Evaluation-Feedback” workflow. Overall, this framework provides a “visualizable, evaluable, and explainable” pathway for scheme generation and pre-screening in the digital governance of rural architectural appearance.
Wu et al. (Tue,) studied this question.