Assessing Pain Among Oncology and Terminally Ill Patients: Psychometric Considerations
David M. Dush, Barrie R. Cassileth, Dennis C. Turk in Psychosocial Assessment in Terminal Care, 2014
If we reported that two independent raters showed an average of 80% agreement in rating each of the PPSI pain behaviors, could one conclude that the PPSI pain behavior scale is reliable? The percentage of agreement among independent raters, computed by taking the total number of inter-rater agreements and dividing this number by the total number of possible ratings, is also not an acceptable method of computing inter-rater reliability. This method does not take chance agreement into account and there are no statistical tests of significance; that is, there is no way to statistically determine what percentage is high enough to be acceptable. Two better methods of establishing inter-rater reliability would be to compute either a weighted kappa coefficient or an intraclass correlation. Both of these methods adjusts for chance agreement and provide a statistical test of significance (see Bartko & Carpenter, 1976 for methods that can be used to calculate these reliability coefficients).
The dependent variable
Robyn L. Tate, Michael Perdices in Single-Case Experimental Designs for Clinical Research and Neurorehabilitation Settings, 2019
Of course, as Kahng and colleagues (2011) point out, high inter-observer agreement does not necessarily mean that the figure reflects the correct score, because it is possible that both raters were incorrect. But certainly, agreement that is less than 80% should be followed up to ascertain the reason for the low inter-observer agreement and procedures that can be implemented to improve it. Inter-rater reliability can also be evaluated using the intraclass correlation coefficient (ICC) for continuous data and kappa/weighted kappa (k/kw) statistic for categorical data. The advantages of these techniques are that they take into account the magnitude of disagreement as well as rank order (ICC/kw) and agreement expected by chance (k/kw). This is particularly important with behaviours occurring at the frequency extremes (very low or high occurrence).
The Efficacy and Safety of MMECT – Seizure Parameters
Barry M. Maletzky, C. Conrad Carter, James L. Fling in Multiple-Monitored Electroconvulsive Therapy, 2019
The availability of the MMECTA has enabled us to accumulate seizure duration data on all of our MMECT patients and to relate these data to therapeutic response.* Treatment methods have been described earlier; assessment was, by nature, retrospective and global, though, for 147 depressed patients, BHS and WT scores were available, and, for 21 schizophrenic patients, BPRS scores were also available. Data analyses for all patients included a − 1 (worse) to 4 + (total improvement) global scale by ward nursing staff. Interrater reliability, separately assessed, was 89.2%. Since patients were also receiving psychotherapy and medications, no absolute statements regarding the efficacy of MMECT can be made, though in almost all cases these treatments had made little impact at the time of initiating ECT.
Cross-cultural adaptation and reliability of the Brazilian version of the wheelchair skills test-questionnaire 4.3 for manual wheelchair users
Published in Assistive Technology, 2022
Lays Cleria Batista Campos, Camila Caminha Caro, Emerson Fachin-Martins, Daniel Marinho Cezar Da Cruz
Test-retest is a very accurate estimate of an instrument’s reliability (regarding a construct that is stable over time). It is generally indicative of reliability in situations when raters are not involved or rater effect can be neglected, such as self-reported survey instruments (Stolarova, Wolf, Rinker, & Brielmann, 2014). On the other side, while inter-rater reliability reflects the variation between two or more raters who measure the same group of subjects, it also may reflect the rating process’ accuracy (Stolarova et al., 2014). Given the subjective nature of the qualitative variable generated by the WSTQ-M-WCU 4.3 score, we did not replicate records from the raters during the test to check intra-rater reliability, working on the assumption that each rater has no uncertainty about their own parameters to choose scores. Then, we reserved the index of agreement to confirm the cross-cultural adaptation and inter-rater and test-retest reliabilities to verify, respectively, how consistently different individuals were scored by the WSTQ-M-WCU 4.3. and if they might have changed their minds after the one-week record.
Brief intervention targeting letter sounds, letter naming, and segmenting frequency skills show promise for improving spelling accuracy1
Published in Evidence-Based Communication Assessment and Intervention, 2020
While this study has many methodological strengths as it relates to the measurement of the dependent variable, the lack of measurement by multiple experimenters was perplexing. The authors set out to provide evidence of the reliability of data measurement in that the graduate assistant recorded data during the session and again later using audio and video recordings. Her secondary measurement matched her live measurement of 100% of sessions. However, the same graduate assistant recorded the dependent variable both in-session and again from recordings. This does not meet the WWC standard that data are collected by a second, independent experimenter for at least 20% of data points in each condition. The purpose of interrater reliability is to measure the reliability in which two independent experimenters are recording the same data while observing the same behavior. The benefits of this approach are lost when the same experimenter records data twice because, by nature, those two observational measurements cannot be independent of one another.
Coverage of the opioid crisis in national network television news from 2000–2020: A content analysis
Published in Substance Abuse, 2022
Jessica Jay, Amy Chan, George Gayed, Julie Patterson
A random probability sample of 7% (15/209) of transcripts was triple coded to pilot test and refine the final coding instrument. The authors then double-coded all video segments using the coding instrument. Specifically, two sets of two authors (J.J. and A.C./J.P. and G.G.) each coded a randomly selected half of all videos. When necessary, the authors annotated the coding instrument with specific characteristics from the specific video segment. Interrater reliability was calculated using Cohen’s Kappa. As a quantitative measure of agreement, a Cohen’s Kappa value of < 0.5 indicated poor reliability; values of 0.5–0.75 indicated moderate reliability values of 0.75–0.9 indicated good reliability; and values > 0.90 indicated excellent reliability. We established 0.75 as the minimum acceptable cutoff for interrater reliability for this study.40 Interrater reliability of included variables was found to be high, ranging from 0.77 to 1. Before analyzing the video segment characteristics, any differences in coding were discussed between the authors and revised based on consensus.
Related Knowledge Centers
- Cohen'S Kappa
- Scott'S Pi
- Fleiss' Kappa
- Concordance Correlation Coefficient
- Intraclass Correlation
- Krippendorff'S Alpha
- Pearson Correlation Coefficient
- Kendall Rank Correlation Coefficient
- Spearman'S Rank Correlation Coefficient
- Standard Deviation