Catalogue Search | MBRL

Calibration of multisite raters for prospective visual reads of amyloid PET scans

by Zeineh, Michael , Koran, Mary Ellen I. , Windon, Charles C. in Alzheimer Disease - diagnostic imaging , Amyloid - metabolism , Amyloid positron emission tomography

2025

INTRODUCTION In multicenter Alzheimer's disease studies, amyloid positron emission tomography (PET) visual reads are typically performed centrally by a few experts. Incorporating a broader reader network enhances scalability and generalizability. METHODS Ten neuroimaging experts from eight Alzheimer's Disease Research Centers (ADRCs) visually read 180 amyloid PET scans (30 scans and 15 duplicate scans for each of four tracers, imaged across a wide variety of scanners), using preferred reading software without anatomical imaging or quantitation. Scans were classified as elevated or non‐elevated per tracer‐specific criteria. Inter‐ and intra‐rater agreement was assessed. RESULTS Inter‐rater agreement was substantial (Fleiss’ κ = 0.78), with full consensus on 69% of scans. Inter‐rater reliability was substantial to perfect across tracers (Fleiss’ κ = 0.70–0.87). Intra‐rater agreement was substantial to perfect (Cohen's κ = 0.79‐1). Scans with intermediate (10–40 Centiloid) quantitation had lower reader agreement. DISCUSSION A multicenter expert network achieved substantial agreement classifying amyloid PET scans. These scans provide a standard for reader training and reliability assurance in future studies. Highlights Calibration methods ensure reliable amyloid positron emission tomography (PET) visual reads across multiple raters. Substantial agreement is possible across readers using their preferred tools. Agreement is also substantial regardless of the amyloid PET tracer used. Scans with intermediate (10–40 Centiloid) quantitation have lower reader agreement. The calibration set will become a training tool for amyloid PET visual read studies.

Journal Article

Share this book

Add to My Shelf

Inter‐rater reliability and clinical relevance of subjective and objective interpretation of videofluoroscopy findings

by Järvenpää, Pia , Vanhatalo, Jaakko , Hirvonen, Jussi in deglutition , Dysphagia , Esophagus

2024

Background Dysphagia is commonly evaluated using videofluoroscopy (VFS). As its ratings are usually subjective normal‐abnormal ratings, objective measurements have been developed. We compared the inter‐rater reliability of the usual VFS ratings to the objective measurement VFS ratings and evaluated their clinical relevance. Methods Two blinded raters analyzed the subjective normal‐abnormal ratings of 77 patients' VFS. Two other blinded raters analyzed the objective measurements of pharyngeal aerated area with bolus held in the oral cavity (PAhold), the pharyngeal area of residual bolus during swallowing (PAmax), the pharyngeal constriction ratio (PCR), the maximum pharyngoesophageal segment opening (PESmax), pharyngoesophageal segment opening duration (POD), airway closure duration (ACD), and total pharyngeal transit time (TPT). We evaluated the inter‐rater agreement in the subjective ratings and the objective measurements. Clinical utility analysis compared the measurements with the VFS findings of pharyngeal phase abnormality, penetration/aspiration, and cricopharyngeal relaxation. Results In the pharyngeal findings, the subjective analysis inter‐rater agreement was mainly moderate to strong. The strongest agreements were on the pharyngeal residues and penetration/aspiration findings. The objective measurements had fair to good inter‐rater agreement. Clinical utility analysis found statistically significant connections between TPT and pharyngeal phase abnormality, normal PCR and lack of penetration/aspiration, and normal PESmax and normal cricopharyngeal relaxation. Conclusions The subjective analysis had moderate to strong inter‐rater agreement in the pharyngeal VFS findings, especially concerning pharyngeal residues and penetration/aspiration detection, reflecting the efficacy and safety of swallowing. The objective measurements had fair to good inter‐observer reproducibility and could thus improve the reliability of VFS diagnostics. Level of evidence 4. Subjective normal‐abnormal analysis of videofluoroscopy is highly reproducible, especially in the detection of swallowing efficacy and safety. However, objective measurements could further improve the reliability of videofluoroscopy diagnostics.

Journal Article

Share this book

Add to My Shelf

Remote sensing‐based mapping of structural building damage in the Ahr valley

by Samprogna Mohor, Guilherme , Koch, Oliver , Sieg, Tobias in Automation , Building damage , Buildings

2025

Flood damage data are needed for various applications. Structural damage of buildings can reflect not only the economic damage but also the life‐threatening condition of a building, which provide crucial information for disaster response and recovery. Since traditional on‐site data collection shortly after a disaster is challenging, remote sensing data can be of great help, cover a wider area and be deployed earlier in time than on‐site surveys. However, this has its challenges and limitations. We elucidate on that by presenting two case studies from flash floods in Germany. First, we assessed the reliability of an existing flood damage schema, which differentiates from minor (structural) damage to complete building collapse. We compared two on‐site raters of the 2016 Braunsbach flood, reaching an excellent level of reliability. Second, we mapped structural building damage after the flood in the Ahr valley in 2021 using a textured 3D mesh and orthophotos. Here, we evaluated the remote sense‐based damage mapping done by three raters. Although the heterogeneity of ratings using remote sensing data is larger than among on‐site ratings, we consider it fit‐for‐purpose when compared with on‐site mapping, especially for event documentation and as basis for financial damage estimation and less complex numerical modelling.

Journal Article

Share this book

Add to My Shelf

Developing and testing inter‐rater reliability of a data collection tool for patient health records on end‐of‐life care of neurological patients in an acute hospital ward

by Sigurdardottir, Valgerdur , Haraldsdottir, Erna , Tryggvadottir, Gudny Bergthora in Communication , Data collection , Disease

2023

Aim Develop and test a data collection tool—Neurological End‐Of‐Life Care Assessment Tool (NEOLCAT)—for extracting data from patient health records (PHRs) on end‐of‐life care of neurological patients in an acute hospital ward. Design Instrument development and inter‐rater reliability (IRR) assessment. Method NEOLCAT was constructed from patient care items obtained from clinical guidelines and literature on end‐of‐life care. Expert clinicians reviewed the items. Using percentage agreement and Fleiss' kappa we calculated IRR on 32 nominal items, out of 76 items. Results IRR of NEOLCAT showed 89% (range 83%–95%) overall categorical percentage agreement. The Fleiss' kappa categorical coefficient was 0.84 (range 0.71–0.91). There was fair or moderate agreement on six items, and moderate or almost perfect agreement on 26 items. Conclusion The NEOLCAT shows promising psychometric properties for studying clinical components of care of neurological patients at the end‐of‐life on an acute hospital ward but could be further developed in future studies.

Journal Article

Share this book

Add to My Shelf

Reliability of the Garden Alignment Index and Valgus Tilt Measurement for Nondisplaced Femoral Neck Fractures

by Norio Yamamoto , Toshiyuki Matsumoto , Ryuichiro Okuda in Agreements , femoral neck fracture , femoral neck fracture; intracapsular hip fracture; Garden alignment index; posterior tilt; inter-rater reliability; intra-rater reliability; intraclass correlation coefficients

2022

Anteroposterior (AP) alignment assessment for nondisplaced femoral neck fractures is important for determining the treatment strategy and predicting postoperative outcomes. AP alignment is generally measured using the Garden alignment index (GAI). However, its reliability remains unknown. We compared the reliability of GAI and a new AP alignment measurement (valgus tilt measurement [VTM]) using preoperative AP radiographs of nondisplaced femoral neck fractures. The study was designed as an intra- and inter-rater reliability analysis. The raters were four trauma surgeons who assessed 50 images twice. The main outcome was the intraclass correlation coefficient (ICC). To calculate intra- and inter-rater reliability, we used a mixed-effects model considering rater, patient, and time. The overall ICC (95% CI) of GAI and VTM for intra-rater reliability was 0.92 (0.89–0.94) and 0.86 (0.82–0.89), respectively. The overall ICC of GAI and VTM for inter-rater reliability was 0.92 (0.89–0.95), and 0.85 (0.81–0.88), respectively. The intra- and inter-rater reliability of GAI was higher in patients aged <80 years than in patients aged ≥80 years. Our results showed that GAI is a more reliable measurement method than VTM, although both are reliable. Variations in patient age should be considered in GAI measurements.

Journal Article

Share this book

Add to My Shelf

Reliability of the revised Cochrane risk-of-bias tool for randomised trials (RoB2) improved with the use of implementation instruction

by Minozzi, Silvia , Filippini, Graziella , Dwan, Kerry in Bias , Calibration , Cannabis

2022

to assess the inter-rater reliability (IRR) of the revised Cochrane risk-of-bias tool for randomised trials (RoB2). Four raters independently applied RoB2 on critical and important outcomes of individually randomized parallel-group trials (RCTs) included in the Cochrane Review “Cannabis and cannabinoids for people with multiple sclerosis.” We calculated Fleiss’ Kappa for multiple raters and time to complete the tool; we performed a calibration exercise on five studies, then we developed an implementation document (ID) specific for the condition, and the intervention addressed by the review with instructions on how to answer the signalling questions of RoB2 tool. We measured IRR before and after the ID adoption Eighty results related to seven outcomes from 16 RCTs were assessed. During calibration exercise we reached no agreement for overall judgment (IRR -0.15); IRR for individual domains ranged from no agreement to fair. Mean time to apply the tool was 168.5 minutes per study. Time to complete the calibration exercise and develop the ID was about 40 hours. After the ID adoption ID, overall agreement increased to slightly (IRR 0.11) for the first five studies and moderate (IRR 0.42) for the remaining 11. IRR for individual domains ranged from no agreement to almost perfect. Mean time to apply the tool decreased to 41 minutes. RoB2 tool is comprehensive but complex even for high experienced raters. The development of an ID specific for the review may improve reliability substantially.

Journal Article

Share this book

Add to My Shelf

Evaluation of Inter-Rater Agreement and Inter-Rater Reliability for Observational Data: An Overview of Concepts and Methods

by Chaturvedi, H K , Bajpai, R C , Shweta in Ratings & rankings , Researchers , Teachers

2015

Evaluation of inter-rater agreement (IRA) or inter-rater reliability (IRR), either as a primary or a secondary component of study is common in various disciplines such as medicine, psychology, education, anthropology and marketing where the use of raters or observers as a method of measurement is prevalent. The concept of IRA/IRR is fundamental to the design and evaluation of research instruments. However, many methods for comparing variations and statistical tests exist, and as a result, there is often confusion about their appropriate use. This may lead to incomplete and inconsistent reporting of results. Consequently, a set of guidelines for reporting reliability and agreement studies has recently been developed to improve the scientific rigor in which IRA/IRR studies are conducted and reported (Gisev, Bell & Chen, 2013; Kottner, Audige, & Brorson, 2011). The objective of this technical note is to present the key concepts in relation to IRA/IRR and to describe commonly used approaches for its evaluation. The emphasis will be more on the practical aspects about their use in behavioral and social research rather than the mathematical derivation of the indices.

Journal Article

Share this book

Add to My Shelf

A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples

by Wedding, Danny , Gwet, Kilem L , Wongpakaran, Tinakon in Agreements , Anxiety , Computer Communication Networks

2013

Background Rater agreement is important in clinical research, and Cohen’s Kappa is a widely used method for assessing inter-rater reliability; however, there are well documented statistical problems associated with the measure. In order to assess its utility, we evaluated it against Gwet’s AC1 and compared the results. Methods This study was carried out across 67 patients (56% males) aged 18 to 67, with a mean SD of 44.13 ± 12.68 years. Nine raters (7 psychiatrists, a psychiatry resident and a social worker) participated as interviewers, either for the first or the second interviews, which were held 4 to 6 weeks apart. The interviews were held in order to establish a personality disorder (PD) diagnosis using DSM-IV criteria. Cohen’s Kappa and Gwet’s AC1 were used and the level of agreement between raters was assessed in terms of a simple categorical diagnosis (i.e., the presence or absence of a disorder). Data were also compared with a previous analysis in order to evaluate the effects of trait prevalence. Results Gwet’s AC1 was shown to have higher inter-rater reliability coefficients for all the PD criteria, ranging from .752 to 1.000, whereas Cohen’s Kappa ranged from 0 to 1.00. Cohen’s Kappa values were high and close to the percentage of agreement when the prevalence was high, whereas Gwet’s AC1 values appeared not to change much with a change in prevalence, but remained close to the percentage of agreement. For example a Schizoid sample revealed a mean Cohen’s Kappa of .726 and a Gwet’s AC1of .853 , which fell within the different level of agreement according to criteria developed by Landis and Koch, and Altman and Fleiss. Conclusions Based on the different formulae used to calculate the level of chance-corrected agreement, Gwet’s AC1 was shown to provide a more stable inter-rater reliability coefficient than Cohen’s Kappa. It was also found to be less affected by prevalence and marginal probability than that of Cohen’s Kappa, and therefore should be considered for use with inter-rater reliability analysis.

Journal Article

Share this book

Add to My Shelf

Challenges in surgical video annotation

by Ban, Yutong , Fer, Danyal M. , Meireles, Ozanan R. in Annotation , Data science , Image classification

2021

Annotation of surgical video is important for establishing ground truth in surgical data science endeavors that involve computer vision. With the growth of the field over the last decade, several challenges have been identified in annotating spatial, temporal, and clinical elements of surgical video as well as challenges in selecting annotators. In reviewing current challenges, we provide suggestions on opportunities for improvement and possible next steps to enable translation of surgical data science efforts in surgical video analysis to clinical research and practice.

Journal Article

Share this book

Add to My Shelf

Better to be in agreement than in bad company

by Siqueira, Jose Oliveira , Silveira, Paulo Sergio Panse in Behavioral Science and Psychology , Cognitive Psychology , Psychology

2023

We assessed several agreement coefficients applied in 2x2 contingency tables, which are commonly applied in research due to dichotomization. Here, we not only studied some specific estimators but also developed a general method for the study of any estimator candidate to be an agreement measurement. This method was developed in open-source R codes and it is available to the researchers. We tested this method by verifying the performance of several traditional estimators over all possible configurations with sizes ranging from 1 to 68 (total of 1,028,789 tables). Cohen’s kappa showed handicapped behavior similar to Pearson’s r, Yule’s Q, and Yule’s Y. Scott’s pi, and Shankar and Bangdiwala’s B seem to better assess situations of disagreement than agreement between raters. Krippendorff’s alpha emulates, without any advantage, Scott’s pi in cases with nominal variables and two raters. Dice’s F1 and McNemar’s chi-squared incompletely assess the information of the contingency table, showing the poorest performance among all. We concluded that Cohen’s kappa is a measurement of association and McNemar’s chi-squared assess neither association nor agreement; the only two authentic agreement estimators are Holley and Guilford’s G and Gwet’s AC1. The latter two estimators also showed the best performance over the range of table sizes and should be considered as the first choices for agreement measurement in contingency 2x2 tables. All procedures and data were implemented in R and are available to download from Harvard Dataverse https://doi.org/10.7910/DVN/HMYTCK.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter