Researchers are tackling the challenge of efficiently assessing image quality using artificial intelligence. Xinyue Li, Zhichao Zhang from Shanghai Jiao Tong University, and Zhiming Xu, et al., demonstrate a novel framework , LEAF , that significantly reduces the reliance on extensive human labelling for image quality assessment (IQA). Their work highlights that the primary difficulty in utilising powerful multimodal large language models (MLLMs) for IQA isn’t perceiving quality, but accurately calibrating the models to human opinion scales. By distilling perceptual knowledge from a large ‘teacher’ MLLM into a smaller, more manageable ‘student’ regressor, LEAF achieves strong performance with minimal human-provided Mean Opinion Scores (MOS), paving the way for practical and cost-effective IQA systems.

Researchers are tackling the challenge of efficiently assessing image quality using artificial intelligence.

MLLM Calibration for Label-Efficient Image Quality Assessment

Specifically, the study details a two-stage process where the teacher MLLM provides dense supervision through both point-wise image judgments and pair-wise preference comparisons, alongside an assessment of its own decision reliability. This rich supervisory signal guides the student model in learning the teacher’s quality perception patterns via joint distillation, followed by calibration on a small subset of images with available MOS data to ensure alignment with human annotations. Experiments conducted on both user-generated and AI-generated IQA benchmarks demonstrate that LEAF substantially lowers the need for human labelling while maintaining strong correlations with MOS, making practical lightweight IQA feasible even with limited annotation resources. The research establishes that MLLMs excel at capturing comparative quality assessments, distinguishing between better and worse images or assigning broad quality levels, but struggle to map these perceptions onto specific MOS rating scales for particular datasets and scenarios.

Figure 2 illustrates this on the AGIQA-3K dataset, showing that direct MLLM scoring maintains reliable ranking but exhibits MOS scale miscalibration, which is substantially improved by calibrating a lightweight head with just 10% of the MOS data. This finding underscores the importance of focusing on MOS scale calibration rather than completely relearning perceptual abilities. This breakthrough reveals a pathway to efficient IQA by decoupling perception from calibration, offering a significant advantage over traditional methods that demand extensive MOS annotation for training. The work opens possibilities for real-world applications requiring fast, low-cost, and scalable quality prediction, such as on-device assessment, large-scale data filtering, and real-time monitoring, all while minimising the burden of human labelling. The team anticipates releasing the code for LEAF upon publication, further enabling the broader research community to benefit from this innovation.

MLLM Distillation for Label-Efficient Image Quality Assessment leverages

This innovative technique significantly reduces reliance on extensive MOS annotations while maintaining strong performance correlations. The core of the work involved dense supervision from the MLLM teacher through both point-wise judgments and pair-wise preferences, alongside an assessment of decision reliability. Researchers engineered this system to generate detailed signals guiding the student regressor in learning the teacher’s quality perception patterns. This learning process employed joint distillation, effectively transferring knowledge from the complex MLLM to the streamlined student network.

Subsequently, the student was calibrated using a small subset of images with corresponding MOS scores, aligning its predictions with human annotations. Experiments were conducted on the KonIQ-10K dataset, comprising 10,073 images and 1,200,000 ratings, and an AI-generated IQA benchmark. To quantify performance, the team measured MOS-aligned correlations, specifically utilising Spearman’s Rank Correlation Coefficient (SRCC) and Pearson Linear Correlation Coefficient (PLCC). The LEAF framework, calibrating a linear head with only 10% of the MOS data, substantially improved the PLCC to 0.907 and reduced the mean residual bias from 0.263 to 0.006. This calibration process involved normalising both MOS and predicted scores to a range of 0 to 1, enabling a direct comparison of residual distributions. The approach enables practical lightweight IQA under limited annotation budgets.

LEAF framework excels with limited image labels

Experiments demonstrate that images are resized to 256 pixels and cropped to 224 × 224 pixels for processing, with random resizing and horizontal flipping applied during training and centre cropping used for evaluation. The team measured performance on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks, achieving remarkable results with limited labelled data. Specifically, on the AGIQA-3K benchmark, the LEAF framework attained a Spearman Rank Order Correlation Coefficient (SRCC) of 0.749 and a Pearson Linear Correlation Coefficient (PLCC) of 0.811 in a label-free setting, surpassing the strongest label-free baseline with an SRCC of 0.684. Results demonstrate that utilising 30% MOS further enhances performance, reaching an SRCC of 0.868 and a PLCC of 0.914 on AGIQA-3K, positioning the method competitively against state-of-the-art weak-supervised approaches that typically require around 70% of MOS labels for calibration.

On the AIGIQA-20K benchmark, the framework achieved an SRCC of 0.696 and a PLCC of 0.762 in the label-free setting, establishing the best SRCC among label-free competitors and exceeding the best label-free PLCC of 0.697. These measurements confirm the effectiveness of teacher-driven dense supervision in transferring reliable quality perception even without extensive human labelling. Tests prove that on UGC benchmarks, the framework also performs strongly; on KonIQ-10k, the method achieved an SRCC of 0.777 and a PLCC of 0.801, while on SPAQ, it reached an SRCC of 0.861 and a PLCC of 0.867. The ablation experiment revealed that the number of samples used does not significantly impact the results, further highlighting the robustness of the approach. The research showcases a significant step towards more efficient and accurate image quality assessment, reducing the need for large-scale human annotation while maintaining high performance.

LEAF calibrates image quality via distillation, improving perceptual

The research addresses the challenge of calibrating models to accurately reflect human perception of image quality, arguing that the primary difficulty isn’t perceptual understanding itself, but aligning model outputs with specific Mean Opinion Score (MOS) scales. This label-efficient approach involves the MLLM teacher providing point-wise judgements and pair-wise preferences, guiding the student model’s learning of quality perception patterns during a distillation phase, without initial dependence on human annotations. A final, lightweight calibration stage then aligns predictions with human annotations using a small MOS subset. Experiments on both user-generated and AI-generated images demonstrate that LEAF significantly reduces annotation requirements while achieving competitive or superior correlation with human judgements, offering a practical alternative to traditional, heavily-annotated regression methods.

The authors acknowledge that performance gains plateaued with larger student backbone architectures, suggesting a moderately sized model offers the best trade-off between capacity and optimisation stability. Future research could explore extending this framework to other label-efficient quality assessment tasks, potentially broadening its applicability beyond image quality assessment. This work represents a significant step towards more practical and accessible IQA systems, reducing the burden of annotation and enabling wider deployment of quality assessment tools.

👉 More information
🗞 Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework
🧠 ArXiv: https://arxiv.org/abs/2601.20689

Tags:

Multimodal Large Language Models