Explore chapters and articles related to this topic
Research Methods and Statistics
Published in Monica Martinussen, David R. Hunter, Aviation Psychology and Human Factors, 2017
Monica Martinussen, David R. Hunter
Of course, we can divide the test into two parts in many ways, depending on how the questions or items are split. One form of reliability that is frequently used for personality traits is the Cronbach's alpha, which is a kind of average split-half reliability of all the possible split-half reliability estimates for a given measure. The various types of reliability will provide different types of information about the test scores. Test–retest reliability will say something about the stability over time, and the other forms of reliability will provide information about the internal consistency (split-half and Cronbach's alpha). The calculated correlations should be as high as possible, preferably 0.70 or higher, but sometimes lower values may be accepted (see, e.g., EFPA 2013). One factor affecting test reliability is the number of questions: the more questions there are, the higher the reliability is. In addition, it is important that the test conditions and scoring procedures be standardized, which means that clear and well-defined procedures are used for all the subjects. Sometimes, the scoring of a test may require some judgment, and then inter-rater reliability should be assessed, for example, by having two or more raters score the same sample of subjects and then assess the degree of consistency between them.
Assessment
Published in Suzanne K. Kearns, Timothy J. Mavin, Steven Hodge, Competency-Based Education in Aviation, 2017
Suzanne K. Kearns, Timothy J. Mavin, Steven Hodge
Baker and Dismukes (1999) identified three methods that could be used to improve inter-rater reliability: rater-error training, performance dimension training, behavioral-observation training and frame-of-reference training. Rater-error training involves explaining to the assessor different errors that can possibly be made during assessment, such as halo effect, horn effect, central tendency, leniency, severity, primacy and recency (Baker and Dismukes 1999; Woehr and Huffcutt 1994). Horn and halo effect occur when final performance assessment is affected by evidence of poor or good performance that is small compared with the entire assessment—for example, an overall good performance being marked down heavily because of one mistake. Leniency and severity can occur if the assessor provides disproportionately high or low marks overall. Pilots have been heard to make comments such as, “I like being checked by Fred, he is an easy marker,” and sometimes training managers will have an issue with pilots calling in sick on simulator assessments because of a check captain who has a reputation of being a strict marker. In contrast, central tendency occurs when the assessor tends to use the middle grades; that is, they rarely fail anyone or provide exemplary marks. Finally, primacy and recency arise when the assessor is affected by how an individual begins or finishes an assessment (Baker and Dismukes 1999).
Trust in International Military Missions: Violations of Trust and Strategies for Repair
Published in Neville A. Stanton, Trust in Military Teams, 2011
Ritu Gill, Megan M. Thompson, Angela R. Febbraro
We also explored participants’ recommendations for trust restoration in this context and whether these spontaneously generated recommendations would reliably reflect the four major theoretical dimensions of trust (competence, integrity, benevolence, and predictability). Sentences of the short-answer responses were coded using a nine-category coding scheme (i.e., positive and negative valence for each of the four trust dimensions, and a miscellaneous category) by two independent coders using NVivo8 (QSR International’s NVivo8, 2008), a qualitative research data-analytic software package. Sentences that referred to more than one theme (e.g., referred to both benevolence and competence) could be coded as reflecting more than one category. The inter-rater reliability (overall mean kappa) was 0.76, which is considered to be excellent agreement (Capozzoli, McSweeney, and Sinha, 1999), and the percent agreement between the raters was 95.4 percent.
Digital Transformation for Agility and Resilience: An Exploratory Study
Published in Journal of Computer Information Systems, 2023
George Mangalaraj, Sridhar Nerur, Rahul Dwivedi
To answer the second question on the types of IT and their role, we closely examined the IT-related words for similarities and dissimilarities among them. In total, there were 1,873 IT-related word counts with 60 unique words. A closer examination of the IT-related words revealed that their usage varies based on the organization and context. For example, IT-related word usage differed in some cases for similar concepts (e.g., online, e-commerce, web), similar types of technology (e.g., CRM, systems price management system), or similar types of techniques (e.g., data analytics, AI). Hence, two authors independently coded the identified 1,873 words into seven IT categories for further analysis. In order to ascertain the inter-rater reliability, the Cohen’s Kappa coefficient was computed and found to be 0.92, which was well above the recommended 0.70.32 Subsequently, the two coders discussed and resolved the disagreements. Table 3 presents the IT categories and their counts.
Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish
Published in Automatika, 2021
Akın Özçift, Kamil Akarsu, Fatma Yumuk, Cevhernur Söylemez
Kappa, i.e. κ, is a statistic metric which measures inter-rater reliability for categorical items. Inter-rater reliability is defined as the degree of agreement among the raters. This statistical score measures the consensus degree based on the decisions of predictors. In other words, it measures the agreement between two predictors who each classify N items into exclusive categories and κ is defined as follows: where is given below and it is also identical to accuracy. In Equation (7), Pe is defined as the number of times a rater i predicted category k with N observations and it is given as and it is also calculated with the following relation in terms of confusion matrix terms. The overall score of κ varies between −1 and 1. The obtained score is a statistical measure of how the obtained results far from occurring by chance. More empirically, while values smaller than 0.40 show fair agreement, the values between 0.40 and 0.60 show moderate agreements. Simply, for a confident classification performance evaluation, we should obtain 0.60–0.80 for good agreement and higher than 0.80 for the perfect agreement [51]. Therefore, we calculated κ metric in the experiments to evaluate statistical confidence of obtained results.
Analysing learning outcomes in an Electrical Engineering curriculum using illustrative verbs derived from Bloom’s Taxonomy
Published in European Journal of Engineering Education, 2018
Lawrence Meda, Arthur James Swart
The rating/scoring was done by the authors, who have published numerous articles focusing on learning outcomes and illustrative verbs taken from Bloom’s Taxonomy, with the most recent one discussing the importance of setting clear, observable and measurable learning outcomes (Swart 2014). The authors have also facilitated numerous workshops at international conferences since 2008 on the importance of academic’s maintaining a proper balance between higher order and lower order questions using Bloom’s Taxonomy. Both authors individually rated the learning outcomes in the various study guides, recording the number of well-structured or poorly structured learning outcomes in an EXCEL sheet, after which an inter-rater reliability score was established. Inter-rater reliability is a measure of reliability that is employed to assess the degree to which different judges or raters agree in their assessment decisions (Ghadi 2016). Inter-rater reliability coefficients should be between 0.7 and 0.9 depending on the level of importance of the decision-making process (Kelly 1927). In this study, an inter-rater reliability score of 91% was attained (Pearson correlation of 0.911). There was no bias in rating/scoring the outcomes.