Fleiss’ Kappa – Knowledge and References

Explore chapters and articles related to this topic

A hierarchical structure model of success factors for (blockchain-based) crowdfunding

Published in Massimo Ragnedda, Giuseppe Destefanis, Blockchain and Web 3.0, 2019

Felix Hartmann, Xiaofeng Wang, Maria Ilaria Lunesu

The responses from the four experts were checked for interrater agreement using Fleiss’ kappa (which is suitable for nominal categorical data from more than two raters), and the result showed slight agreement among the experts (kappa = 0.134).

Risk investigation in circular economy: a hierarchical decision model approach

View Article

Journal Information

Published in International Journal of Logistics Research and Applications, 2021

Divya Choudhary, Rahul Kumar

For classification systems various categorical scales can be used to verify the consistency among the experts. Initially, in 1968, Cohen developed a statistical measure (i.e. Kappa statistics) to determine the consistency measure when there were only two raters. Kappa (k) is an index of the agreement for use with nominal scales, analogous to an alternate form reliability coefficient for continuous data (Schippmann, Prien, and Hughes 1991). Later, Fleiss (1981) extended the agreement measurement approach for more than two raters (m raters), assigning n items into C categories and is called Fleiss kappa. The idea is to assess whether raters assign the item to the same category or different categories. The degree of consensus among the various experts indicates the consistency of the values. If consensus is high, the ratings reflect the actual circumstance. This allows access to measure the degree of agreement among several experts who participated in the session.

Automatic estimation of ulcerative colitis severity from endoscopy videos using ordinal multi-instance learning

View Article

Journal Information

Published in Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2021

Evan Schwab, Gabriela Oana Cula, Kristopher Standish, Stephen S. F. Yip, Aleksandar Stojmirovic, Louis Ghanem, Christel Chehoud

The second notable limitation is the relatively high subjectivity of scoring UC severity by expert GIs. A Fleiss Kappa () metric is commonly utilised to assesses inter-rater agreement between multiple raters with the following scale: (slight), (fair), (moderate), (substantial), and (nearly perfect). As reported in Daperno et al. (2014), the inter-rater agreement of video-level MES between 14 expert raters was moderate with a Fleiss ( confidence interval (CI) 0.47–0.56). Likewise, Principi et al. (2020) report moderate Fleiss ( CI 0.39–0.66) with 13 expert raters. A Fleiss ( CI 0.51–0.69) was also reported between 4 expert raters for the clinical trial data (Sands et al. 2019) that served as our ground truth labels for training and validation.

Using gamification elements for competitive crowdsourcing: exploring the underlying mechanism

View Article

Journal Information

Published in Behaviour & Information Technology, 2021

Congcong Yang, Hua Jonathan Ye, Yuanyue Feng

We also conducted a pilot test with 40 individuals to validate the new instruments. Following the procedures introduced by Moore and Benbasat (1991), items for all constructs were tested with a two-stage Q-sorting process to enhance their content validity, convergent validity, and discriminant validity. All instruments were measured using 5-point Likert-scales ranging from ‘strongly disagree’ to ‘strongly agree’ (See Table 3 for the instruments). Because we conducted a survey on a Chinese crowdsourcing platform, the original English questionnaire was translated into Chinese by six bilingual (i.e. proficient in both English and Mandarin Chinese) information systems professors. Then we compared the six versions of Mandarin Chinese questionnaires and resolved inconsistencies among them. After that, the Mandarin Chinese questionnaire was translated back to English by another six bilingual information systems professors. Then we compared the six English questionnaires and saw whether they were consistent with the original English questionnaire. For controversial translations, we carefully compared each translation and judged which one was more appropriate. We also assessed inter-rater agreement by calculating the Fleiss’ kappa score. The Fleiss’ kappa scores for all of the items were well above 0.70, indicating that there were substantial agreements among the translators (Landis and Koch 1977).

Fleiss' Kappa

Explore chapters and articles related to this topic

A hierarchical structure model of success factors for (blockchain-based) crowdfunding

Risk investigation in circular economy: a hierarchical decision model approach

Automatic estimation of ulcerative colitis severity from endoscopy videos using ordinal multi-instance learning

Using gamification elements for competitive crowdsourcing: exploring the underlying mechanism