Catalogue Search | MBRL

How to Classify, Detect, and Manage Univariate and Multivariate Outliers, With Emphasis on Pre-Registration

by Ley, Christophe , Mora, Youri L. , Leys, Christophe in Data analysis , Datasets , Hypotheses

2019

Researchers often lack knowledge about how to deal with outliers when alyzing their data. Even more frequently, researchers do not pre-specify how they plan to mage outliers. In this paper we aim to improve research practices by outlining what you need to know about outliers. We start by providing a functiol definition of outliers. We then lay down an appropriate nomenclature/classification of outliers. This nomenclature is used to understand what kinds of outliers can be encountered and serves as a guideline to make appropriate decisions regarding the conservation, deletion, or recoding of outliers. These decisions might impact the validity of statistical inferences as well as the reproducibility of our experiments. To be able to make informed decisions about outliers you first need proper detection tools. We remind readers why the most common outlier detection methods are problematic and recommend the use of the median absolute deviation to detect univariate outliers, and of the Mahalanobis-MCD distance to detect multivariate outliers. An R package was created that can be used to easily perform these detection tests. Filly, we promote the use of pre-registration to avoid flexibility in data alysis when handling outliers.Publishers note: due to a typesetting error, this paper was origilly published with incorrect table numbering, where tables 2, 3, and 4 were incorrectly labelled. This was corrected soon after publication.

Journal Article

Share this book

Add to My Shelf

Machine learning methods in sport injury prediction and prevention: a systematic review

by Van Eetvelde, Hans , Ley, Christophe , Tischer, Thomas in Algorithms , Feature selection , Injury prediction

2021

Purpose Injuries are common in sports and can have significant physical, psychological and financial consequences. Machine learning (ML) methods could be used to improve injury prediction and allow proper approaches to injury prevention. The aim of our study was therefore to perform a systematic review of ML methods in sport injury prediction and prevention. Methods A search of the PubMed database was performed on March 24th 2020. Eligible articles included original studies investigating the role of ML for sport injury prediction and prevention. Two independent reviewers screened articles, assessed eligibility, risk of bias and extracted data. Methodological quality and risk of bias were determined by the Newcastle–Ottawa Scale. Study quality was evaluated using the GRADE working group methodology. Results Eleven out of 249 studies met inclusion/exclusion criteria. Different ML methods were used (tree-based ensemble methods ( n = 9), Support Vector Machines ( n = 4), Artificial Neural Networks ( n = 2)). The classification methods were facilitated by preprocessing steps ( n = 5) and optimized using over- and undersampling methods ( n = 6), hyperparameter tuning ( n = 4), feature selection ( n = 3) and dimensionality reduction ( n = 1). Injury predictive performance ranged from poor (Accuracy = 52%, AUC = 0.52) to strong (AUC = 0.87, f1-score = 85%). Conclusions Current ML methods can be used to identify athletes at high injury risk and be helpful to detect the most important injury risk factors. Methodological quality of the analyses was sufficient in general, but could be further improved. More effort should be put in the interpretation of the ML models.

Journal Article

Share this book

Add to My Shelf

TailCoR: A new and simple metric for tail correlations that disentangles the linear and nonlinear dependencies that cause extreme co-movements

by Ley, Christophe , Ricci, Lorenzo , Veredas, David in Analysis , Coronaviruses , COVID-19

2023

Economic and financial crises are characterised by unusually large events. These tail events co-move because of linear and/or nonlinear dependencies. We introduce TailCoR, a metric that combines (and disentangles) these linear and non-linear dependencies. TailCoR between two variables is based on the tail inter quantile range of a simple projection. It is dimension-free, and, unlike competing metrics, it performs well in small samples and no optimisations are needed. Indeed, TailCoR requires a few lines of coding and it is very fast. A Monte Carlo analysis confirms the goodness of the metric, which is illustrated on a sample of 21 daily financial market indexes across the globe and for 20 years. The estimated TailCoRs are in line with the financial and economic events, such as the 2008 great financial crisis and the 2020 pandemic.

Journal Article

Share this book

Add to My Shelf

Boosting any learning algorithm with Statistically Enhanced Learning

by Ley, Christophe , Bordas, Stéphane P. A. , Felice, Florian in 639/705/1041 , 639/705/117 , 639/705/531

2025

Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). Contrary to existing approaches, predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on practical use cases.

Journal Article

Share this book

Add to My Shelf

Outdoor Walking Classification Based on Inertial Measurement Unit and Foot Pressure Sensor Data

by Ley, Christophe , Emmerzaal, Jill , Garcia, Frederic in Algorithms , Artificial intelligence , Biomechanics

2025

(1) Background: Navigating surfaces during walking can alter gait patterns. This study aims to develop tools for automatic walking condition classification using inertial measurement unit (IMU) and foot pressure sensors. We compared sensor modalities (IMUs on lower-limbs, IMUs on feet, IMUs on the pelvis, pressure insoles, and IMUs on the feet or pelvis combined with pressure insoles) and evaluated whether gait cycle segmentation improves performance compared to a sliding window. (2) Methods: Twenty participants performed flat, stairs up, stairs down, slope up, and slope down walking trials while fitted with IMUs and pressure insoles. Machine learning (ML; Extreme Gradient Boosting) and deep learning (DL; Convolutional Neural Network + Long Short-Term Memory) models were trained to classify these conditions. (3) Results: Overall, a DL model using lower-limb IMUs processed with gait segmentation performed the best (F1=0.89). Models trained with IMUs outperformed those trained on pressure insoles (p<0.01). Combining sensor modalities and gait segmentation improved performance for ML models (p<0.01). The best minimal model was a DL model trained on IMU pelvis + pressure insole data using sliding window segmentation (F1=0.83). (4) Conclusions: IMUs provide the most discriminative features for automatic walking condition classification. Combining sensor modalities may be helpful for some model architectures. DL models perform well without gait segmentation, making them independent of gait event identification algorithms.

Journal Article

Share this book

Add to My Shelf

A new generator for proposing flexible lifetime distributions and its properties

by Ley, Christophe , Shah, Said Farooq , Asghar, Zahid in Analysis , Censorship , Computer simulation

2020

In this paper, we develop a generator to propose new continuous lifetime distributions. Thanks to a simple transformation involving one additional parameter, every existing lifetime distribution can be rendered more flexible with our construction. We derive stochastic properties of our models, and explain how to estimate their parameters by means of maximum likelihood for complete and censored data, where we focus, in particular, on Type-II, Type-I and random censoring. A Monte Carlo simulation study reveals that the estimators are consistent. To emphasize the suitability of the proposed generator in practice, the two-parameter Fréchet distribution is taken as baseline distribution. Three real life applications are carried out to check the suitability of our new approach, and it is shown that our extension of the Fréchet distribution outperforms existing extensions available in the literature.

Journal Article

Share this book

Add to My Shelf

Relationship between a daily injury risk estimation feedback (I-REF) based on machine learning techniques and actual injury risk in athletics (track and field): protocol for a prospective cohort study over an athletics season

by Ley, Christophe , Chapon, Joris , Hollander, Karsten in Algorithms , Artificial Intelligence , Athletes

2023

IntroductionTwo-thirds of athletes (65%) have at least one injury complaint leading to participation restriction (ICPR) in athletics (track and field) during one season. The emerging practice of medicine and public health supported by electronic processes and communication in sports medicine represents an opportunity for developing new injury risk reduction strategies. Modelling and predicting the risk of injury in real-time through artificial intelligence using machine learning techniques might represent an innovative injury risk reduction strategy. Thus, the primary aim of this study will be to analyse the relationship between the level of Injury Risk Estimation Feedback (I-REF) use (average score of athletes’ self-declared level of I-REF consideration for their athletics activity) and the ICPR burden during an athletics season.Method and analysisWe will conduct a prospective cohort study, called Injury Prediction with Artificial Intelligence (IPredict-AI), over one 38-week athletics season (from September 2022 to July 2023) involving competitive athletics athletes licensed with the French Federation of Athletics. All athletes will be asked to complete daily questionnaires on their athletics activity, their psychological state, their sleep, the level of I-REF use and any ICPR. I-REF will present a daily estimation of the ICPR risk ranging from 0% (no risk for injury) to 100% (maximal risk for injury) for the following day. All athletes will be free to see I-REF and to adapt their athletics activity according to I-REF. The primary outcome will be the ICPR burden over the follow-up (over an athletics season), defined as the number of days lost from training and/or competition due to ICPR per 1000 hours of athletics activity. The relationship between ICPR burden and the level of I-REF use will be explored by using linear regression models.Ethics and disseminationThis prospective cohort study was reviewed and approved by the Saint-Etienne University Hospital Ethical Committee (Institutional Review Board: IORG0007394, IRBN1062022/CHUSTE). Results of the study will be disseminated in peer-reviewed journals and in international scientific congresses, as well as to the included participants.

Journal Article

Share this book

Add to My Shelf

Dynamical SPQEIR model assesses the effectiveness of non-pharmaceutical interventions against COVID-19 epidemic outbreaks

by Gonçalves, Jorge , Ley, Christophe , Proverbio, Daniele in Applications of mathematics , Applied mathematics , Belgium

2021

Against the current COVID-19 pandemic, governments worldwide have devised a variety of non-pharmaceutical interventions to mitigate it. However, it is generally difficult to estimate the joint impact of different control strategies. In this paper, we tackle this question with an extended epidemic SEIR model, informed by a socio-political classification of different interventions. First, we inquire the conceptual effect of mitigation parameters on the infection curve. Then, we illustrate the potential of our model to reproduce and explain empirical data from a number of countries, to perform cross-country comparisons. This gives information on the best synergies of interventions to control epidemic outbreaks while minimising impact on socio-economic needs. For instance, our results suggest that, while rapid and strong lockdown is an effective pandemic mitigation measure, a combination of social distancing and early contact tracing can achieve similar mitigation synergistically, while keeping lower isolation rates. This quantitative understanding can support the establishment of mid- and long-term interventions, to prepare containment strategies against further outbreaks. This paper also provides an online tool that allows researchers and decision makers to interactively simulate diverse scenarios with our model.

Journal Article

Share this book

Add to My Shelf

Skew-symmetric distributions and Fisher information: The double sin of the skew-normal

by HALLIN, MARC , LEY, CHRISTOPHE in centred parametrization , consistency rates , Distribution functions

2014

Hallin and Ley [Bernoulli 18 (2012) 747-763] investigate and fully characterize the Fisher singularity phenomenon in univariate and multivariate families of skew-symmetric distributions. This paper proposes a refined analysis of the (univariate) problem, showing that singularity can be more or less severe, inducing n1/4 (\"simple singularity\") n⅙ (\"double singularity\"), or n1/8 (\"triple singularity\") consistency rates for the skewness parameter. We show, however, that simple singularity (yielding n⅙ consistency rates), if any singularity at all, is the rule, in the sense that double and triple singularities are possible for generalized skew-normal families only. We also show that higher-order singularities, leading to worse-than-n⅛ rates, cannot occur. Depending on the degree of the singularity, our analysis also suggests a simple reparametrization that offers an alternative to the so-called centred parametrization proposed, in the particular case of skew-normal and skew-f families, by Azzalini [Scand. J. Stat. 12 (1985) 171-178], Arei lano-Val le and Azzalini [J. Multivariate Anal. 113 (2013) 73-90], and DiCiccio and Monti [Quaderni di Statistica 13 (2011) 1-21], respectively.

Journal Article

Share this book

Add to My Shelf

Skew-symmetric distributions and Fisher information — a tale of two densities

by HALLIN, MARC , LEY, CHRISTOPHE in Covariance matrices , Degrees of freedom , Fisher information

2012

Skew-symmetric densities recently received much attention in the literature, giving rise to increasingly general families of univariate and multivariate skewed densities. Most of those families, however, suffer from the inferential drawback of a potentially singular Fisher information in the vicinity of symmetry. All existing results indicate that Gaussian densities (possibly after restriction to some linear subspace) play a special and somewhat intriguing role in that context. We dispel that widespread opinion by providing a full characterization, in a general multivariate context, of the information singularity phenomenon, highlighting its relation to a possible link between symmetric kernels and skewing functions - a link that can be interpreted as the mismatch of two densities.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter