Although diffusion-based image generation models enable high-quality synthesis of fashion images, the reliable control of perceptual attributes in these models remains poorly understood. Current evaluation approaches primarily rely on semantic similarity metrics, such as CLIP scores, which may not accurately reflect human perceptual judgments. This study proposes a three-layer evaluation framework linking latent space geometry, semantic embedding space, and human perception. First, latent attribute directions are validated using geometric quality-control metrics measuring linearity and centrality. Second, semantic consistency is examined through directional projection in CLIP embedding space. Third, a two-alternative forced-choice experiment is conducted with 37 participants, and perceptual strength is estimated using a Bradley-Terry preference model. Experiments cover gender and garment conditions for four fashion attributes: fit, lightness, glossiness, and pattern scale. Results reveal that fit exhibits strong cross-layer alignment, while pattern scale shows semantic and perceptual ambiguity. The findings highlight that perceptual reliability in controllable generation is attribute-dependent and that semantic metrics alone cannot fully replace human evaluation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Noriaki Kuwahara
Shintaro Kawanami
Takashi Sato
International Journal of Advanced Computer Science and Applications
Building similarity graph...
Analyzing shared references across papers
Loading...
Kuwahara et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69fbefef164b5133a91a408b — DOI: https://doi.org/10.14569/ijacsa.2026.0170424