Do Image-Text Metrics Respect Semantic Invariances?
Abstract. This paper studies whether reference-free image-to-text evaluators respect meaning-preserving changes in images and captions. It probes popular caption-alignment metrics across spatial, object-level, and socio-linguistic perturbations, showing that non-semantic changes can shift scores and alter system rankings. The work also proposes invariance-calibrated scoring to reduce these sensitivities while preserving alignment with learned caption evaluators.