IEEE Transactions on Broadcasting, vol. 69, no. 2, pp. 378-395, June 2023
Abstract: This paper calculates conﬁdence intervals for 89 datasets that use the 5-level Absolute Category Rating (ACR) method to evaluate the quality of speech, video, images, and video with audio. This data allows us to compute the subjective test conﬁdence interval (ΔS_CI) for 5-level ACR tests. We use a confusion matrix to compare conclusions reached by 88 lab-to-lab comparisons, 22 method-to-method comparisons, and 12 comparisons between expert and naïve subjects. We estimate the differences in conclusions reached by ad hoc evaluations, compared to subjective tests. We recommend using the disagree incidence rate to identify lab-to-lab differences (i.e., the likelihood that signiﬁcantly different stimulus pairs receive opposing rank order from the two labs). Disagree incidence rates above 0.31% are unusual enough to warrant investigation and disagree incidence rates above 1.0%indicate differences in method, test environment, test implementation, or subject demographics. These incidence rates form the basis for a new statistical method that calculates the conﬁdence interval of a metric (ΔM_CI). When ΔM_CI is used to make decisions, the equivalence to a video-quality test (EVQT) method determines whether a metric acts similarly to a subjective test. When ΔM_CI is not used, the metric is likened to a certain number of people in a video-quality test (PVQT). This information will help users make the better decisions when applying quality metrics. The algorithm code is made available for any purpose. Most of the ratings used in this paper come from open datasets.
Keywords: statistics; video quality; precision; mean opinion score; MOS; image quality; audiovisual quality; metric; subjective test; confidence interval; confusion matrix; false ranking
For technical information concerning this report, contact:
Margaret H. Pinson
Institute for Telecommunication Sciences
Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.