Catalogue Search | MBRL

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

by Chicco, Davide , Jurman, Giuseppe in Accuracy , Algorithms , Analysis

2020

Background To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F 1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. Results The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. Conclusions In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F 1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F 1 score in evaluating binary classification tasks by all scientific communities.

Journal Article

Share this book

Add to My Shelf

Improving the reliability of measurements in orthopaedics and sports medicine

by Karlsson, Jon , Mouton, Caroline , Królikowska, Aleksandra in agreement , Clinical trials , Correlation coefficient

2023

A large space still exists for improving the measurements used in orthopaedics and sports medicine, especially as we face rapid technological progress in devices used for diagnostic or patient monitoring purposes. For a specific measure to be valuable and applicable in clinical practice, its reliability must be established. Reliability refers to the extent to which measurements can be replicated, and three types of reliability can be distinguished: inter-rater, intra-rater, and test–retest. The present article aims to provide insights into reliability as one of the most important and relevant properties of measurement tools. It covers essential knowledge about the methods used in orthopaedics and sports medicine for reliability studies. From design to interpretation, this article guides readers through the reliability study process. It addresses crucial issues such as the number of raters needed, sample size calculation, and breaks between particular trials. Different statistical methods and tests are presented for determining reliability depending on the type of gathered data, with particular attention to the commonly used intraclass correlation coefficient.

Journal Article

Share this book

Add to My Shelf

The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation

by Tötsch, Niklas , Chicco, Davide , Jurman, Giuseppe in Accuracy , Algorithms , Balanced accuracy

2021

Evaluating binary classifications is a pivotal task in statistics and machine learning, because it can influence decisions in multiple areas, including for example prognosis or therapies of patients in critical conditions. The scientific community has not agreed on a general-purpose statistical indicator for evaluating two-class confusion matrices (having true positives, true negatives, false positives, and false negatives) yet, even if advantages of the Matthews correlation coefficient (MCC) over accuracy and F 1 score have already been shown.In this manuscript, we reaffirm that MCC is a robust metric that summarizes the classifier performance in a single value, if positive and negative cases are of equal importance. We compare MCC to other metrics which value positive and negative cases equally: balanced accuracy (BA), bookmaker informedness (BM), and markedness (MK). We explain the mathematical relationships between MCC and these indicators, then show some use cases and a bioinformatics scenario where these metrics disagree and where MCC generates a more informative response.Additionally, we describe three exceptions where BM can be more appropriate: analyzing classifications where dataset prevalence is unrepresentative, comparing classifiers on different datasets, and assessing the random guessing level of a classifier. Except in these cases, we believe that MCC is the most informative among the single metrics discussed, and suggest it as standard measure for scientists of all fields. A Matthews correlation coefficient close to +1, in fact, means having high values for all the other confusion matrix metrics. The same cannot be said for balanced accuracy, markedness, bookmaker informedness, accuracy and F 1 score.

Journal Article

Share this book

Add to My Shelf

The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification

by Chicco, Davide , Jurman, Giuseppe in Algorithms , Analysis , Area under the curve

2023

Binary classification is a common task for which machine learning and computational statistics are used, and the area under the receiver operating characteristic curve (ROC AUC) has become the common standard metric to evaluate binary classifications in most scientific fields. The ROC curve has true positive rate (also called sensitivity or recall ) on the y axis and false positive rate on the x axis, and the ROC AUC can range from 0 (worst result) to 1 (perfect result). The ROC AUC, however, has several flaws and drawbacks. This score is generated including predictions that obtained insufficient sensitivity and specificity, and moreover it does not say anything about positive predictive value (also known as precision ) nor negative predictive value (NPV) obtained by the classifier, therefore potentially generating inflated overoptimistic results. Since it is common to include ROC AUC alone without precision and negative predictive value, a researcher might erroneously conclude that their classification was successful. Furthermore, a given point in the ROC space does not identify a single confusion matrix nor a group of matrices sharing the same MCC value. Indeed, a given (sensitivity, specificity) pair can cover a broad MCC range, which casts doubts on the reliability of ROC AUC as a performance measure. In contrast, the Matthews correlation coefficient (MCC) generates a high score in its [ - 1 ; + 1 ] interval only if the classifier scored a high value for all the four basic rates of the confusion matrix: sensitivity, specificity, precision, and negative predictive value. A high MCC (for example, MCC = 0.9), moreover, always corresponds to a high ROC AUC, and not vice versa. In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

Journal Article

Share this book

Add to My Shelf

Identification of mobile genetic elements with geNomad

by Xu, Yan , Kyrpides, Nikos C. , Nayfach, Stephen in 631/114/1305 , 631/114/794 , 631/326/171

2024

Identifying and characterizing mobile genetic elements in sequencing data is essential for understanding their diversity, ecology, biotechnological applications and impact on public health. Here we introduce geNomad, a classification and annotation framework that combines information from gene content and a deep neural network to identify sequences of plasmids and viruses. geNomad uses a dataset of more than 200,000 marker protein profiles to provide functional gene annotation and taxonomic assignment of viral genomes. Using a conditional random field model, geNomad also detects proviruses integrated into host genomes with high precision. In benchmarks, geNomad achieved high classification performance for diverse plasmids and viruses (Matthews correlation coefficient of 77.8% and 95.3%, respectively), substantially outperforming other tools. Leveraging geNomad’s speed and scalability, we processed over 2.7 trillion base pairs of sequencing data, leading to the discovery of millions of viruses and plasmids that are available through the IMG/VR and IMG/PR databases. geNomad is available at https://portal.nersc.gov/genomad . geNomad identifies mobile genetic elements in sequencing data.

Journal Article

Share this book

Add to My Shelf

Trastuzumab deruxtecan in metastatic breast cancer with variable HER2 expression: the phase 2 DAISY trial

by Bachelot, Thomas , Baris, Vianney , Jimenez, Marta in 631/67/1059/602 , 631/67/1347 , Biomedical and Life Sciences

2023

The mechanisms of action of and resistance to trastuzumab deruxtecan (T-DXd), an anti-HER2–drug conjugate for breast cancer treatment, remain unclear. The phase 2 DAISY trial evaluated the efficacy of T-DXd in patients with HER2-overexpressing ( n = 72, cohort 1), HER2-low ( n = 74, cohort 2) and HER2 non-expressing ( n = 40, cohort 3) metastatic breast cancer. In the full analysis set population ( n = 177), the confirmed objective response rate (primary endpoint) was 70.6% (95% confidence interval (CI) 58.3–81) in cohort 1, 37.5% (95% CI 26.4–49.7) in cohort 2 and 29.7% (95% CI 15.9–47) in cohort 3. The primary endpoint was met in cohorts 1 and 2. Secondary endpoints included safety. No new safety signals were observed. During treatment, HER2-expressing tumors ( n = 4) presented strong T-DXd staining. Conversely, HER2 immunohistochemistry 0 samples ( n = 3) presented no or very few T-DXd staining (Pearson correlation coefficient r = 0.75, P = 0.053). Among patients with HER2 immunohistochemistry 0 metastatic breast cancer, 5 of 14 (35.7%, 95% CI 12.8–64.9) with ERBB2 expression below the median presented a confirmed objective response as compared to 3 of 10 (30%, 95% CI 6.7–65.2) with ERBB2 expression above the median. Although HER2 expression is a determinant of T-DXd efficacy, our study suggests that additional mechanisms may also be involved. (ClinicalTrials.gov identifier NCT04132960 .) Trastuzumab deruxtecan, an anti-HER2–drug conjugate, exhibits the highest objective response rate in patients with HER2-overexpressing metastatic breast cancer, but clinical activity is also observed in patients with HER2-low or non-expressing tumors, potentially pointing to additional determinants of drug efficacy.

Journal Article

Share this book

Add to My Shelf

Why Cohen’s Kappa should be avoided as performance measure in classification

by Tibau, Xavier-Andoni , Delgado, Rosario in Accuracy , Artificial intelligence , Biology and Life Sciences

2019

We show that Cohen's Kappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in which Kappa exhibits an undesired behaviour, i.e. a worse classifier gets higher Kappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour of Kappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC and Kappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy between Kappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disables Kappa to be used in general as a performance measure to compare classifiers.

Journal Article

Share this book

Add to My Shelf

1 km monthly temperature and precipitation dataset for China from 1901 to 2017

by Ding, Yongxia , Li, Zhi , Liu, Wenzhao in Air temperature , Climate , Climate change

2019

High-spatial-resolution and long-term climate data are highly desirable for understanding climate-related natural processes. China covers a large area with a low density of weather stations in some (e.g., mountainous) regions. This study describes a 0.5′ (∼ 1 km) dataset of monthly air temperatures at 2 m (minimum, maximum, and mean proxy monthly temperatures, TMPs) and precipitation (PRE) for China in the period of 1901–2017. The dataset was spatially downscaled from the 30′ Climatic Research Unit (CRU) time series dataset with the climatology dataset of WorldClim using delta spatial downscaling and evaluated using observations collected in 1951–2016 by 496 weather stations across China. Prior to downscaling, we evaluated the performances of the WorldClim data with different spatial resolutions and the 30′ original CRU dataset using the observations, revealing that their qualities were overall satisfactory. Specifically, WorldClim data exhibited better performance at higher spatial resolution, while the 30′ original CRU dataset had low biases and high performances. Bicubic, bilinear, and nearest-neighbor interpolation methods employed in downscaling processes were compared, and bilinear interpolation was found to exhibit the best performance to generate the downscaled dataset. Compared with the evaluations of the 30′ original CRU dataset, the mean absolute error of the new dataset (i.e., of the 0.5′ dataset downscaled by bilinear interpolation) decreased by 35.4 %–48.7 % for TMPs and by 25.7 % for PRE. The root-mean-square error decreased by 32.4 %–44.9 % for TMPs and by 25.8 % for PRE. The Nash–Sutcliffe efficiency coefficients increased by 9.6 %–13.8 % for TMPs and by 31.6 % for PRE, and correlation coefficients increased by 0.2 %–0.4 % for TMPs and by 5.0 % for PRE. The new dataset could provide detailed climatology data and annual trends of all climatic variables across China, and the results could be evaluated well using observations at the station. Although the new dataset was not evaluated before 1950 owing to data unavailability, the quality of the new dataset in the period of 1901–2017 depended on the quality of the original CRU and WorldClim datasets. Therefore, the new dataset was reliable, as the downscaling procedure further improved the quality and spatial resolution of the CRU dataset and was concluded to be useful for investigations related to climate change across China. The dataset presented in this article has been published in the Network Common Data Form (NetCDF) at https://doi.org/10.5281/zenodo.3114194 for precipitation (Peng, 2019a) and https://doi.org/10.5281/zenodo.3185722 for air temperatures at 2 m (Peng, 2019b) and includes 156 NetCDF files compressed in zip format and one user guidance text file.

Journal Article

Share this book

Add to My Shelf

The WHO Bacterial Priority Pathogens List 2024: a prioritisation study to guide research, development, and public health strategies against antimicrobial resistance

by Garlasco, Jacopo , Sati, Hatim , Amir, Afreenish in Anti-Bacterial Agents - pharmacology , Anti-Bacterial Agents - therapeutic use , Antibiotic resistance

2025

The 2017 WHO Bacterial Priority Pathogens List (BPPL) has been instrumental in guiding global policy, research and development, and investments to address the most urgent threats from antibiotic-resistant pathogens, and it is a key public health tool for the prevention and control of antimicrobial resistance (AMR). Since its release, at least 13 new antibiotics targeting bacterial priority pathogens have been approved. The 2024 WHO BPPL aims to refine and build on the previous list by incorporating new data and evidence, addressing previous limitations, and improving pathogen prioritisation to better guide global efforts in combating AMR. The 2024 WHO BPPL followed a similar approach to the first prioritisation exercise, using a multicriteria decision analysis framework. 24 antibiotic-resistant bacterial pathogens were scored based on eight criteria, including mortality, non-fatal burden, incidence, 10-year resistance trends, preventability, transmissibility, treatability, and antibacterial pipeline status. Pathogens were assessed on each of the criteria on the basis of available evidence and expert judgement. A preferences survey using a pairwise comparison was administered to 100 international experts (among whom 79 responded and 78 completed the survey) to determine the relative weights of the criteria. Applying these weights, the final ranking of pathogens was determined by calculating a total score in the range of 0–100% for each pathogen. Subgroup and sensitivity analyses were conducted to assess the impact of experts’ consistency, background, and geographical origin on the stability of the rankings. An independent advisory group reviewed the final list, and pathogens were subsequently streamlined and grouped into three priority tiers based on a quartile scoring system: critical (highest quartile), high (middle quartiles), and medium (lowest quartile). The pathogens’ total scores ranged from 84% for the top-ranked bacterium (carbapenem-resistant Klebsiella pneumoniae) to 28% for the bottom-ranked bacterium (penicillin-resistant group B streptococci). Antibiotic-resistant Gram-negative bacteria (including K pneumoniae, Acinetobacter spp, and Escherichia coli), as well as rifampicin-resistant Mycobacterium tuberculosis, were ranked in the highest quartile. Among the bacteria commonly responsible for community-acquired infections, the highest rankings were for fluoroquinolone-resistant Salmonella enterica serotype Typhi (72%), Shigella spp (70%), and Neisseria gonorrhoeae (64%). Other important pathogens on the list include Pseudomonas aeruginosa and Staphylococcus aureus. The results of the preferences survey showed a strong inter-rater agreement, with Spearman's rank correlation coefficient and Kendall's coefficient of concordance both at 0·9. The final ranking showed high stability, with clustering of the pathogens based on experts’ backgrounds and origins not resulting in any substantial changes to the ranking. The 2024 WHO BPPL is a key tool for prioritising research and development investments and informing global public health policies to combat AMR. Gram-negative bacteria and rifampicin-resistant M tuberculosis remain critical priority pathogens, underscoring their persistent threat and the limitations of the current antibacterial pipeline. Focused efforts and sustained investments in novel antibacterials are needed to address AMR priority pathogens, which include high-burden antibiotic-resistant bacteria such as Salmonella and Shigella spp, N gonorrhoeae, and S aureus. Beyond research and development, efforts to address these pathogens should also include expanding equitable access to existing drugs, enhancing vaccine coverage, and strengthening infection prevention and control measures. This work is based on the development of the 2024 WHO BPPL, which was conducted by the WHO AMR Division through grants from the Government of Austria, the Government of Germany, the Government of Saudi Arabia, and the European Commission's Health Emergency Preparedness and Response Authority. For the Arabic, French, Italian, Japanese and Spanish translations of the abstract see Supplementary Materials section.

Journal Article

Share this book

Add to My Shelf

Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient

by Foody, Giles M. in Accuracy , Analysis , Classification

2023

The accuracy of a classification is fundamental to its interpretation, use and ultimately decision making. Unfortunately, the apparent accuracy assessed can differ greatly from the true accuracy. Mis-estimation of classification accuracy metrics and associated mis-interpretations are often due to variations in prevalence and the use of an imperfect reference standard. The fundamental issues underlying the problems associated with variations in prevalence and reference standard quality are revisited here for binary classifications with particular attention focused on the use of the Matthews correlation coefficient (MCC). A key attribute claimed of the MCC is that a high value can only be attained when the classification performed well on both classes in a binary classification. However, it is shown here that the apparent magnitude of a set of popular accuracy metrics used in fields such as computer science medicine and environmental science (Recall, Precision, Specificity, Negative Predictive Value, J, F 1 , likelihood ratios and MCC) and one key attribute (prevalence) were all influenced greatly by variations in prevalence and use of an imperfect reference standard. Simulations using realistic values for data quality in applications such as remote sensing showed each metric varied over the range of possible prevalence and at differing levels of reference standard quality. The direction and magnitude of accuracy metric mis-estimation were a function of prevalence and the size and nature of the imperfections in the reference standard. It was evident that the apparent MCC could be substantially under- or over-estimated. Additionally, a high apparent MCC arose from an unquestionably poor classification. As with some other metrics of accuracy, the utility of the MCC may be overstated and apparent values need to be interpreted with caution. Apparent accuracy and prevalence values can be mis-leading and calls for the issues to be recognised and addressed should be heeded.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter