Patient-Reported Outcomes: Development and Validation
Demissie Alemayehu, Joseph C. Cappelleri, Birol Emir, Kelly H. Zou in Statistical Topics in Health Economics and Outcomes Research, 2017
For categorical data, the kappa statistic is a measure of agreement between two nominal variables having two or more categories (Cohen, 1960). The kappa coefficient explicitly adjusts for agreement that occurs by chance alone, and thus can be defined as chance-corrected agreement. For example, the data in Table 2.2 on the diagnosis of ED come from two methods: a gold standard (clinical) diagnosis and a PRO-based diagnosis (erectile function domain of the IIEF). The overall agreement observed is simply (1000 + 102) / 1151 = 0.96 (or 96%). After correction for the expectation that a certain number of agreements would arise by chance alone, the kappa coefficient (or the proportion of chance-corrected agreement) between the erectile function domain of the IIEF and the clinical diagnosis was 0.78, which is substantial given that the maximum value of kappa is 1.
How Will the Data be Analyzed?
Trena M. Paulus, Alyssa Friend Wise in Looking for Insight, Transformation, and Learning in Online Talk, 2019
Cohen’s kappa (κ, Cohen, 1960) is relatively straightforward to calculate and produces an accurate estimate of reliability when rating data is complete and rater error is evenly distributed. While the original kappa statistic is calculated for two raters and one coding category, various adapted versions that allow for additional raters and categories have been developed. Kappa values range from 0 to 1 with a higher kappa indicating higher inter-rater reliability. There is no generally accepted threshold value for an acceptable level of kappa, though benchmarks of 0.61–0.80 as “good” and 0.81–1.00 as “very good” have been suggested (Landis & Koch, 1977; Altman, 1991). Alternatively, Fleiss, Levin, and Paik (2003) describe kappa of 0.40–0.75 as “intermediate to good” and greater than 0.75 as “excellent.” A good rule of thumb is that κ> 0.70 is reasonable for supporting inferences about the content of online talk, results based on 0.60 < κ<.70 should be interpreted with caution, and content analysis yielding κ<.60 should not be interpreted.
Measuring the impact of parental illness
David Morley, Xiaoming Li, Crispin Jenkinson in Children and Young People's Response to Parental Illness, 2016
A third aspect of reliability which may be considered is the inter-rater reliability of an instrument. Inter-rater reliability assesses the consistency of responses when two or more respondents complete a questionnaire (McColl, 2005). Inter-rater reliability can be particularly relevant in the context of assessing the impact of parental ill health on children where both parents complete the questionnaire. It can also be used to assess agreement with child and parent responses. The kappa statistic, where a value of 0.70 or higher agreement is considered acceptable, is most frequently used (Eiser and Morse, 2001).
Time use and physical activity in a specialised brain injury rehabilitation unit: an observational study
Published in Brain Injury, 2018
Leanne Hassett, Siobhan Wong, Emma Sheaves, Maysaa Daher, Andrew Grady, Cara Egan, Carol Seeto, Talia Hosking, Anne Moseley
Activities were coded as therapeutic activities if they were activities completed with therapists within scheduled or unscheduled therapy sessions. Activities were also coded as therapeutic activities if they were out of therapy sessions but were clearly part of the participants’ therapy programme (e.g. walking with walking aid/assistance, actively involved in self-care or eating with some assistance, practice sheet or therapy equipment being used). Eight of the nine observers had the reliability of their observations checked in the week prior to data collection. One observer was omitted due to annual leave. Both the first author and each individual observer followed the pre-determined route together and conducted one set of observations independently for each participant. The Kappa statistic was used to assess the level of agreement between two raters (13). The level of agreement for all of the categories was classified as “substantial” agreement (i.e. Kappa ≥ 0.61) for body position, activity therapeutic and activity non-therapeutic (Kappa = 0.647, 0.742, 0.802; p < 0.001, respectively) to “almost perfect” agreement (i.e. Kappa ≥ 0.81) for people present, location, session type and session structure (Kappa = 0.935, 0.974, 1.00, 1.00; p < .001, respectively). These results were reported back to the observers prior to commencement of data collection and areas of discrepancy (e.g. body position) were clarified.
Diagnosing type 2 myocardial infarction in clinical routine. A validation study
Published in Scandinavian Cardiovascular Journal, 2019
Anton Gard, Bertil Lindahl, Gorav Batra, Marcus Hjort, Karolina Szummer, Tomasz Baron
Kappa statistics is seldom used in diagnosis validation studies although it has been proposed for diagnose validation in Swedish quality registers [15]. The advantage of Kappa statistics is that it corrects for chance agreement [8]. In the present study, there was a moderate rate of agreement (κ: 0.43) in deciding a type 2 MI diagnosis between the SWEDEHEART registry and the gold standard classification. A better agreement would be desirable. However, the few existing clinical criteria for a type 2 MI diagnosis are not very specific [2] and publications on type 2 MI display an inconsistency in the type 2 MI definition [11,16,17], signaling that there is a disagreement in this classification also in the research community. Further, a moderate rate of agreement was also seen between the gold standard reviewers in the present study, even though they were specially trained to follow the MI classification presented in the Third Universal Definition of Myocardial Infarction. This indicates that these criteria are open to interpretations and are very challenging to apply in clinical routine.
Diagnostic accuracy of the triglyceride-glucose index for gestational diabetes screening: a practical approach
Published in Gynecological Endocrinology, 2020
Adriana Sánchez-García, René Rodríguez-Gutiérrez, Donato Saldívar-Rodríguez, Abel Guzmán-López, Carolina Castillo-Castro, Leonardo Mancillas-Adame, Karla Santos-Santillana, Victoria González-Nava, José Gerardo González-González
We selected a pre-specified sensitivity for screening diagnostic tests [19]. To determine the sample size, we considered a 90% of sensitivity of the TyG index to identify GDM with a 95% confidence, a significance level α = 0.05, and a power of 97.5%. The estimated sample size was 140 women. Categorical variables were reported as percentages and frequencies; continuous variables as means and standard deviations. An unpaired Student’s t-test or Mann-Whitney U test was used to compare continuous variables. Inter-rater agreement was estimated using the kappa statistic. We determined a cutoff value and area under the curve (AUC) as the basis for the estimates of the ROC curve. To establish the risk of GDM, subjects were categorized on tertiles by the distribution of the TyG index. The lowest tertile was used as reference. A p value ≤ .05 was considered statistically significant and IBM SPSS version 22.0 (IBM Corp., Armonk, NY, USA) was used to perform the analysis.
Related Knowledge Centers
- Inter-Rater Reliability
- P-Value
- Scott'S Pi
- Fleiss' Kappa
- Youden'S J Statistic
- Intraclass Correlation
- Krippendorff'S Alpha