Interrater Agreement Issues
Studies suggest good to excellent inter-evaluator reliability for semi-structured interviews (Zimmerman, 1994; Rogers, 2003; Segal and Coolidge, 2003), but in most cases this is limited to matching the responses of two coding respondents to the same interview (Widiger and Samuel, 2005). The reliability of test-retests is lower, especially if the interval between interviews is longer than a few weeks or coincides with an improvement in mental state (Zimmerman 1994). The implication is that patients change and that assessments are reliable for this period. The difference between the inter-assessor reliability of the joint interview, the reliability of the short-interval test, and the reliability of the long-term test varies among personality disorders, but antisocial personality disorder is less affected than others. Intervaluor reliability refers to the extent to which two or more people match. Suppose two people are sent to a clinic to observe the wait times, the appearance of the waiting and examination rooms and the general atmosphere. If the observers were in perfect agreement on all points, the reliability of the interraters would be perfect. Interrater`s reliability is enhanced by the training of data collectors who provide them with advice on how to record their observations, monitor the quality of data collection over time to ensure people don`t burn out, and discuss difficult problems or problems. Intrarater reliability refers to the consistency of measurement by a single person, and this too can be improved through training, monitoring and continuous training. If the number of categories used is small (e.B. 2 or 3), the probability that 2 evaluators agree purely by chance increases considerably.
Indeed, both evaluators must limit themselves to the limited number of options available, which influences the overall rate of agreement, and not necessarily their propensity for “intrinsic” agreement (an agreement is considered “intrinsic” if it is not random). In our case, Assessor A had a kappa = 0.506 and Assessor B a kappa = 0.585 in intra-rater tests, while in inter-rater tests, kappa was 0.580 for the first measure and 0.535 for the second measure. Such kappa values seem to indicate moderate success of intra- and inter-rater tests, just above the middle between the case of kappa = 0 (results due to random alone) and kappa = 1 (evaluator in perfect agreement). Brown, R. D., and Hauenstein, N.M. A. (2005). The interract agreement reconsidered: an alternative to rwg indices. Organ.
Res. Methods 8, 165-184. doi: 10.1177/1094428105275376 agreement between observers (i.e. inter-evaluator agreement) can be quantified with different criteria, but their appropriate selection is crucial. If the measure is qualitative (nominal or ordinal), the proportion of agreement or kappa coefficient should be used to assess consistency between evaluators (i.e., reliability between evaluators). The kappa coefficient is more significant than the gross percentage of compliance, since the latter does not take into account agreements solely on the basis of chance. If the measures are quantitative, the intraclass correlation coefficient (ICC) should be used to assess the match, but this should be done with caution as there are different CCIs, so it is important to describe the model and type of CCI used. The Bland-Altman method can be used to assess consistency and compliance, but its application should be limited to comparing two evaluators.
Burke, M. J., Finkelstein, L.M., and Dusig, M. S. (1999). Average deviation indices to estimate evaluator compliance. Organ. Methods 2, 49–68. doi: 10.1177/109442819921004 Bronson and Bundy (2001) investigated the reliability of the evaluators and the estimated error of the element model. The quality of the adjustment statistics showed that the data of 100% of the evaluators (n = 10) met the expectations of the Rasch model. In addition, the estimated errors in the article template were low for all but one (<.25) ("young playmates read players` notes"; Error = 0.26). James, L.
R., Demaree, R. G., and Wolf, G. (1984). Estimation of the reliability of evaluators within the group with and without response bias. J. Appl. Psychol. 69, 85–98. doi: 10.1037/0021-9010.69.1.85 Kappa is similar to a correlation coefficient in that it cannot go above +1.0 or below -1.0.
Because it is used as a match measure, only positive values are expected in most situations. negative values would indicate systematic disagreements. Kappa can only reach very high values if both matches are good and the rate of the target condition is close to 50% (because it includes the base rate in the calculation of common probabilities). Several agencies have proposed “rules of thumb” for interpreting the degree of agreement, many of which essentially correspond, although the words are not identical.     A good example of the source of concern about the importance of the kappa results obtained is shown in an article that compared visual detection of abnormalities in biological samples by humans to automated detection (12). The results showed only moderate agreement between human and automated evaluators for kappa (κ = 0.555), but the same data gave an excellent percentage of agreement of 94.2%. The problem with interpreting the results of these two statistics is: how are researchers supposed to decide whether evaluators are reliable or not? Do the results obtained indicate that the vast majority of patients receive accurate laboratory results and therefore correct or incorrect medical diagnoses? In the same study, the researchers chose a data collector as the standard and compared the results of five other technicians with the standard. Although sufficient data to calculate a percentage match is not specified in the article, the kappa results were only moderate.
How is the lab manager supposed to know if the results represent high-quality readings with little disagreement between trained lab technicians or if there is a serious problem and additional training is needed? Unfortunately, kappa statistics do not provide enough information to make such a decision. In addition, a kappa can have such a wide confidence interval (CI) that it understands everything from the good to the bad game. Calculation of percentage match (fictitious data). LeBreton, J.M., Burgess, J. R. D., Kaiser, R.B., Atchley, E. K. P., and James, L.
R. (2003). Limiting the variance hypothesis and the reliability and agreement of evaluators: Are assessments from multiple sources really different? Organ. Res. Methoden 6, 78–126. doi: 10.1177/1094428102239427 Cohen, A., Doveh, E., and Nahum-Shani, I. (2009). Test agreement for multi-element scales with rwg(j) and ADm(j) indexes. Organ.
Res. Methoden 12, 148–164. doi: 10.1177/1094428107300365 Harvey, R. J., and Hollander, E. (2004, April). “Benchmarking rWG interrater agreement indices: let`s drop the.70 rule-of-thumb”, in Paper presented at the meeting of the Society for Industrial Organizational Psychology (Chicago, IL). As Marusteri and Bacarea (9) noted, there is never 100% certainty about research results, even when statistical significance is reached. Statistical results to test hypotheses about the relationship between independent and dependent variables become meaningless if there is an inconsistency in the evaluation of variables by evaluators. If the consent is less than 80%, more than 20% of the data analyzed is incorrect.
For a reliability of only 0.50 to 0.60, it must be understood that 40% to 50% of the analyzed data is incorrect. If the kappa values are less than 0.60, the confidence intervals on the kappa obtained are so wide that it can be assumed that about half of the data could be incorrect (10). It is clear that statistical significance does not mean much when there are so many errors in the results tested. LeBreton, J.M., James, L. R., and Lindell, M. K. (2005). Current issues with rWG, rWG*, rWG(J), and rWG(J)*. Organ. Res.
Methoden 8, 128–138. doi: 10.1177/1094428104272181 Cohen, A., Doveh, E., and Eick, U. (2001). Statistical properties of the correspondence index rwg(j). Psychol. Methods 6, 297–310. doi: 10.1037/1082-989X.6.3.297 Many situations in the healthcare sector rely on multiple people to collect research or clinical laboratory data. The question of consistency or agreement between the people collecting the data arises immediately due to the variability between human observers. Well-designed research studies should therefore include procedures that measure the correspondence between different data collectors. Study designs usually involve the training of data collectors and the extent to which they record the same values for the same phenomena. .