Catalogue Search | MBRL

Machine learning for discovering missing or wrong protein function annotations

by Vens, Celine , Nakano, Felipe Kenji , Lietaert, Mathias in Algorithms , Benchmark datasets , Benchmarking

2019

A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

Journal Article

Share this book

Add to My Shelf

Comparison between the EKFC-equation and machine learning models to predict Glomerular Filtration Rate

by Courbebaisse, Marie , Ebert, Natalie , Åkesson, Anna in 639/705/1046 , 639/705/117 , 639/705/794

2024

In clinical practice, the glomerular filtration rate (GFR), a measurement of kidney functioning, is normally calculated using equations, such as the European Kidney Function Consortium (EKFC) equation. Despite being the most general equation, EKFC, just like previously proposed approaches, can still struggle to achieve satisfactory performance, limiting its clinical applicability. As a possible solution, recently machine learning (ML) has been investigated to improve GFR prediction, nonetheless the literature still lacks a general and multi-center study. Using a dataset with 19,629 patients from 13 cohorts, we investigate if ML can improve GFR prediction in comparison to EKFC. More specifically, we compare diverse ML methods, which were allowed to use age, sex, serum creatinine, cystatin C, height, weight and BMI as features, in internal and external cohorts against EKFC. The results show that the most performing ML method, random forest (RF), and EKFC are very competitive where RF and EKFC achieved respectively P10 and P30 values of 0.45 (95% CI 0.44;0.46) and 0.89 (95% CI 0.88;0.90), whereas EKFC yielded 0.44 (95% CI 0.43; 0.44) and 0.89 (95% CI 0.88; 0.90), considering the entire cohort. Small differences were, however, observed in patients younger than 12 years where RF slightly outperformed EKFC.

Journal Article

Share this book

Add to My Shelf

Correction: Development and validation of a machine learning model for early prediction of intensive care unit acquired weakness

by Van den Berghe, Greet , Coppens, Gregoire , Nakano, Felipe Kenji in Correction , Critical Care Medicine , Intensive

2025

Journal Article

Share this book

Add to My Shelf

Development and validation of a machine learning model for early prediction of intensive care unit acquired weakness

by Van den Berghe, Greet , Grandas, Fabian Güiza , Coppens, Grégoire in C-reactive protein , Calibration , Clinical outcomes

2025

Background Early identification of potential high cost and high need patients on the ICU may assist in the development of targeted protocols, which allows proper resource utilization and initialization of preventive care. Weakness acquired in the ICU developed within the first week is an independent predictor of both short and long-term adverse outcomes, nonetheless early prediction is challenging. We aimed to develop and validate a machine learning model for ICU acquired-weakness (ICU-AW), using data readily available within the first 24 h of ICU admission. Methods Patients from the EPaNIC trial (NCT00512122, N = 4640) who were assessed for muscle weakness at day 9 (IQR 8–13), after ICU-admission, using the Medical Research Council (MRC) sum. Patients are diagnosed with ICU-AW if their MRC is lower than 48. The final subset contains N = 600. Our models were internally validated using 100 repetitions of fivefold cross validation. We compared three predictive models: (i) a random forest and (ii) a logistic regression model built using descriptors available at day 1, (iii) a random forest using only APACHE II as a descriptor. Both random forests contain 150 trees. Results The training set comprised 600 patients where the incidence of ICU-AW was 38.6% (232/600). The AUROC of the random forest with all descriptors and the logistic regression were 76% and 74%, respectively. The random forest (RF) achieved a specificity of 62% and a sensitivity 79%, whereas the logistic regression yielded 69% and 68%, respectively. The RF identified APACHE II, creatinine, SOFA PaO2/FiO2, bilirubin, BMI, age, glycemia upon admission, morning glycemia and sepsis as the most relevant descriptors. Lastly, the RF also presented very good calibration and clinical usefulness for a wide range of risk thresholds. Conclusions Machine learning models, especially random forests, can be used to predict if patients are at risk of developing ICU-AW, using data available within 24 h of admission. This tool allows prognostication early in an adult general critically ill patient population, with the potential to detect high cost and high need patients who benefit from different levels of care.

Journal Article

Share this book

Add to My Shelf

Enhancing individual glomerular filtration rate assessment: can we trust the equation? Development and validation of machine learning models to assess the trustworthiness of estimated GFR compared to measured GFR

by Courbebaisse, Marie , Ebert, Natalie , Bökenkamp, Arend in Adult , Aged , Algorithms

2025

Background Creatinine-based estimated glomerular filtration rate (eGFR) equations are widely used in clinical practice but exhibit inherent limitations. On the other side, measuring GFR is time consuming and not available in routine clinical practice. We developed and validated machine learning models to assess the trustworthiness (i.e. the ability of equations to estimate measured GFR (mGFR) within 10%, 20% or 30%) of the European Kidney Function Consortium (EKFC) equation at the individual level. Methods This observational study used data from European and US cohorts, comprising 22,343 participants of all ages with available mGFR results. Four machine learning and two traditional logistic regression models were trained on a cohort of 9,202 participants to predict the likelihood of the EKFC creatinine-derived eGFR falling within 30% (p30), 20% (p20) or 10% (p10) of the mGFR value. The algorithms were internally and then externally validated on cohorts of respectively 3,034 and 10,107 participants. The predictors included in the models were creatinine, age, sex, height, weight, and EKFC. Results The random forest model was the most robust model. In the external validation cohort, the model achieved an area under the curve of 0.675 (95%CI 0.660;0.690) and an accuracy of 0.716 (95%CI 0.707;0.725) for the P30 criterion. Sensitivity was 0.756 (95%CI 0.747;0.765) and specificity was 0.485 (95%CI 0.460; 0.511) at the 80% probability level that EKFC falls within 30% of mGFR. At the population level, the PPV of this machine learning model was 89.5%, higher than the EKFC P30 of 85.2%. A free web-application was developed to allow the physician to assess the trustworthiness of EKFC at the individual level. Conclusions A strategy using machine learning model marginally improves the trustworthiness of GFR estimation at the population level. An additional value of this approach lies in its ability to provide assessments at the individual level.

Journal Article

Share this book

Add to My Shelf

Machine learning for discovering missing or wrong protein function annotations

by Vens, Celine , Nakano, Felipe Kenji , Lietaert, Mathias in Algorithms , Bioinformatics , Biomedical and Life Sciences

2019

Background A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

Journal Article

Share this book

Add to My Shelf

Active learning for hierarchical multi-label classification

by Cerri, Ricardo , Vens Celine , Nakano, Felipe Kenji in Active learning , Algorithms , Classification

2020

Due to technological advances, a massive amount of data is produced daily, presenting challenges for application areas where data needs to be labelled by a domain specialist or by expensive procedures, in order to be useful for supervised machine learning purposes. In order to select which data points will provide more information when labelled, one can make use of active learning methods. Active learning (AL) is a subfield of machine learning which addresses methods to build models with fewer, but more representative instances. Even though AL has been vastly studied, it has not been thoroughly investigated in hierarchical multi-label classification, a learning task where multiple class labels can be assigned to an instance and these labels are hierarchically structured. In this work, we provide a public framework containing baseline and state-of-the-art algorithms suitable for this task. Additionally, we also propose a new algorithm, namely Hierarchical Query-By-Committee (H-QBC), which is validated on datasets from different domains. Our results show that H-QBC is capable of providing superior predictive performance results compared to its competitors, while being computationally efficient and parameter free.

Journal Article

Share this book

Add to My Shelf

Deep tree-ensembles for multi-output prediction

by Felipe Kenji Nakano , Vens, Celine , Pliakos, Konstantinos in Artificial neural networks , Classification , Machine learning

2021

Recently, deep neural networks have expanded the state-of-art in various scientific fields and provided solutions to long standing problems across multiple application domains. Nevertheless, they also suffer from weaknesses since their optimal performance depends on massive amounts of training data and the tuning of an extended number of parameters. As a countermeasure, some deep-forest methods have been recently proposed, as efficient and low-scale solutions. Despite that, these approaches simply employ label classification probabilities as induced features and primarily focus on traditional classification and regression tasks, leaving multi-output prediction under-explored. Moreover, recent work has demonstrated that tree-embeddings are highly representative, especially in structured output prediction. In this direction, we propose a novel deep tree-ensemble (DTE) model, where every layer enriches the original feature set with a representation learning component based on tree-embeddings. In this paper, we specifically focus on two structured output prediction tasks, namely multi-label classification and multi-target regression. We conducted experiments using multiple benchmark datasets and the obtained results confirm that our method provides superior results to state-of-the-art methods in both tasks.

Paper

Share this book

Add to My Shelf

Oxytrees: Model Trees for Bipartite Learning

by Felipe Kenji Nakano , Gharahighehi, Alireza , Ilídio, Pedro in Cognitive tasks , Datasets , Machine learning

2025

Bipartite learning is a machine learning task that aims to predict interactions between pairs of instances. It has been applied to various domains, including drug-target interactions, RNA-disease associations, and regulatory network inference. Despite being widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices, significantly reducing training time without compromising predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees with that of the current state-of-the-art. We achieved up to 30-fold improvement in training times compared to state-of-the-art biclustering forests, while demonstrating competitive or superior performance in most evaluation settings, particularly in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

Paper

Share this book

Add to My Shelf

Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation

by Felipe Kenji Nakano , Gharahighehi, Alireza , Yang, Xuehua in Customization , Decision trees , Filtration

2025

Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user's tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter