Despite their impressive performance, modern foundation models for image recognition still struggle with complex reasoning involving multiple sources of information and can exhibit unexpected behaviours such as spurious correlations or inaccurate predictions. Moreover, their decision processes remain opaque, limiting trust and interpretability. This thesis introduces dialogue-based image recognition to address these challenges, a novel paradigm for performing and understanding visual recognition through structured interactions between agents. Three frameworks have been developed. The first, inspired by argumentative dialogue, enables two agents to transparently deliberate over image classification using global prototype similarities and local visual attributes derived from foundation models. The second extends the dialogue-based approach to complex scene understanding, where agents combine visual information and a knowledge base to verify image descriptions. The third explores human-machine dialogue for model editing, allowing users to identify and correct unexpected behaviours. Together, these contributions establish the first exploration of dialogue as a foundation for explainable visual recognition, showing how it can tackle key challenges of opacity, complex reasoning, and unexpected behaviours in image recognition.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dao Thauvin
Building similarity graph...
Analyzing shared references across papers
Loading...
Dao Thauvin (Tue,) studied this question.