We present a systematic evaluation of color recognition accuracy across four Vision-Language Models (GPT-4o, Claude 3.5 Sonnet, Claude Sonnet 4, LLaVA 7B) using 40 colors from the HSL color space with 480 total observations, measured by CIEDE2000. Commercial models achieve mean ΔE00 of 2.51-3.33, while LLaVA 7B shows dramatically higher error (ΔE00 = 24.63). All models perform better on primary colors than intermediate hues. 95.4% of AI-generated UI pixels fall in the blue-purple range, connecting VLM color biases to the "AI Slop" phenomenon.
Ken Imoto (Sun,) studied this question.