Catalogue Search | MBRL

Chi-Square Target Encoding for Categorical Data Representation: A Real-World Sensor Data Case Study

by Savarimuthu, Nickolas , Anitha, M. , Bhanu, S. Mary Saira in Accuracy , Advances in Intelligent and Secured Protocols towards Mobile and IoT Paradigms , Algorithms

2025

Feature engineering is critical for improving machine learning performance (ML), especially when handling categorical data. Traditional encoding methods, such as one-hot and label encoding, often result in challenges like high dimensionality and loss of category significance. To address these limitations, this study introduces a novel Chi-Square Target Encoding (CSTE) approach, which transforms categorical variables into numerical representations by leveraging the chi-square statistic to evaluate the association between categories and the target variable, preserving information effectively. Unlike conventional techniques, CSTE uses Chi-square test values to generate target-based encoded representations, ensuring reliable transformation without information loss. Comprehensive empirical evaluations were conducted on various datasets for binary classification, showcasing CSTE’s superiority over methods like regularized target encoding, basic target encoding, and fuzzification. A case study on real-world sensor data further validated its efficiency and scalability for large-scale data-driven applications. The proposed CSTE method achieved an average increment rate of 3.9447%, surpassing one-hot and fuzzification (1.0430%) and regularized target encoding (1.8308%). Classification outcomes demonstrated F1-scores exceeding 0.90 and AUC values nearing 0.99 across diverse datasets, highlighting its robustness. Furthermore, the reduced dimensionality significantly enhanced inference time while maintaining high accuracy. The CSTE method offers a robust framework for categorical data transformation, addressing limitations of traditional encoding techniques. It improves the interpretability and efficiency of categorical data representation, boosting ML performance. This innovative approach is well-suited for applications across various domains involving categorical data.

Journal Article

Share this book

Add to My Shelf

Advancing Sustainable Learning Environments: A Literature Review on Data Encoding Techniques for Student Performance Prediction using Deep Learning Models in Education

by Khoulji, Samira , Laarbi Kerkeb, Mohammed , Ouahi, Mariame in Algorithms , categorical data encoding , Coding

2024

The utilization of neural model techniques for predicting learner performance has exhibited success across various technical domains, including natural language processing. In recent times, researchers have progressively directed their attention towards employing these methods to contribute to socioeconomic sustainability, particularly in the context of forecasting student academic performance. Additionally, educational data frequently encompass numerous categorical variables, and the efficacy of prediction models becomes intricately tied to sustainable encoding techniques applied to manage and interpret this data. This approach aligns with the broader goal of fostering sustainable development in education, emphasizing responsible and equitable practices in leveraging advanced technologies for enhanced learning outcomes. Building on this insight, this paper presents a literature review that delves into the use of machine learning techniques for predicting learner outcomes in online training courses. The objective is to offer a summary of the most recent models designed for forecasting student performance, categorical coding methodologies, and the datasets employed. The research conducts experiments to assess the suggested models both against each other and in comparison to certain prediction techniques utilizing alternative machine learning algorithms concurrently. The findings suggest that employing the encoding technique for transforming categorical data enhances the effectiveness of deep learning architectures. Notably, when integrated with long short-term memory networks, this strategy yields exceptional results for the examined issue.

Journal Article

Share this book

Add to My Shelf

CatBoost for big data: an interdisciplinary review

by Hancock, John T. , Khoshgoftaar, Taghi M. in Algorithms , Best practice , Big Data

2020

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Journal Article

Share this book

Add to My Shelf

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

by Pargent, Florian , Pfisterer, Florian , Bischl, Bernd in Algorithms , Best practice , Data analysis

2022

Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.

Journal Article

Share this book

Add to My Shelf

Perceptual warping exposes categorical representations for speech in human brainstem responses

by Bidelman, Gavin M. , Carter, Jared A. in Acoustic phonetics , Acoustic Stimulation , Acoustics

2023

•Measured brainstem FFRs during online speech categorization.•Speech-FFRs were enhanced in active vs. passive listening.•FFR speech representations were warped according to listeners’ phonetic label.•Subcortical activity carries a perceptual code and is actively modulated in a top-manner during speech perception. The brain transforms continuous acoustic events into discrete category representations to downsample the speech signal for our perceptual-cognitive systems. Such phonetic categories are highly malleable, and their percepts can change depending on surrounding stimulus context. Previous work suggests these acoustic-phonetic mapping and perceptual warping of speech emerge in the brain no earlier than auditory cortex. Here, we examined whether these auditory-category phenomena inherent to speech perception occur even earlier in the human brain, at the level of auditory brainstem. We recorded speech-evoked frequency following responses (FFRs) during a task designed to induce more/less warping of listeners’ perceptual categories depending on stimulus presentation order of a speech continuum (random, forward, backward directions). We used a novel clustered stimulus paradigm to rapidly record the high trial counts needed for FFRs concurrent with active behavioral tasks. We found serial stimulus order caused perceptual shifts (hysteresis) near listeners’ category boundary confirming identical speech tokens are perceived differentially depending on stimulus context. Critically, we further show neural FFRs during active (but not passive) listening are enhanced for prototypical vs. category-ambiguous tokens and are biased in the direction of listeners’ phonetic label even for acoustically-identical speech stimuli. These findings were not observed in the stimulus acoustics nor model FFR responses generated via a computational model of cochlear and auditory nerve transduction, confirming a central origin to the effects. Our data reveal FFRs carry category-level information and suggest top-down processing actively shapes the neural encoding and categorization of speech at subcortical levels. These findings suggest the acoustic-phonetic mapping and perceptual warping in speech perception occur surprisingly early along the auditory neuroaxis, which might aid understanding by reducing ambiguity inherent to the speech signal.

Journal Article

Share this book

Add to My Shelf

Neural network approach enhancing churn prediction with categorical encoding and standard scaling

by Samantaray, Subham Pankaj , Bhadra, Somasree , Madhu, Utpal in 639/166 , 639/705 , Accuracy

2026

Customer churn prediction is a crucial application of machine learning in business analytics. This article presents a controlled benchmarking of a multilayer perceptron model trained with one-hot encoding and standard scaling using 10,000 customer records with 12 features. Data pre-processing was performed using one-hot encoding and standard scaling to improve model generalisation. The model achieved an ROC AUC of 0.8640 with a recall of 0.4178 and an F1 score of 0.5534. The measure of correct predictions among cases labelled positive (precision) was relatively high at 0.8214 ± 0.0253. The moderately low false-positive rate indicates that the model rarely misclassifies non-churners as churners, which is very important for cost-efficient customer retention programs. A group-level heterogeneity review showed that model performance, measured by BRIER scores, was best for France (0.076439), followed by Spain (0.077394), and worst for Germany (0.106267). The model successfully identified the 7 most important fields of judgement, with permutation scores ranging from 0.000147 to 0.120842. The calibration test helps identify underconfident and overconfident model performance through quantile bins. The performance is evaluated using a bin range from 0.12 to 0.25, with a trustworthy prediction for low- to mid-risk customers. A comparative analysis with basis models of machine learning has shown that the best overall accuracy (0.8720) and balanced F1-score (0.6098) were obtained using the Gradient Boosting algorithm, with the best performance, while the proposed Neural Network with categorical encoding and standard scaling attained the highest precision (0.8528), effectively minimising false positives in churn detection. Although the PR_AUC was slightly lower (0.7140), the model had better recall (0.4128) and an equal ROC_AUC (0.8657), indicating the proposed model’s strength and ability to operate on complex, high-dimensional churn data.

Journal Article

Share this book

Add to My Shelf

Out of (the) bag—encoding categorical predictors impacts out-of-bag samples

by Marshall, Jonathan C. , French, Nigel P. , Smith, Helen L. in Absent levels , Accuracy , Automatic classification

2024

Performance of random forest classification models is often assessed and interpreted using out-of-bag (OOB) samples. Observations which are OOB when a tree is trained may serve as a test set for that tree and predictions from the OOB observations used to calculate OOB error and variable importance measures (VIM). OOB errors are popular because they are fast to compute and, for large samples, are a good estimate of the true prediction error. In this study, we investigate how target-based vs . target-agnostic encoding of categorical predictor variables for random forest can bias performance measures based on OOB samples. We show that, when categorical variables are encoded using a target-based encoding method, and when the encoding takes place prior to bagging, the OOB sample can underestimate the true misclassification rate, and overestimate variable importance. We recommend using a separate test data set when evaluating variable importance and/or predictive performance of tree based methods that utilise a target-based encoding method.

Journal Article

Share this book

Add to My Shelf

Orthogonal Matrix-Autoencoder-Based Encoding Method for Unordered Multi-Categorical Variables with Application to Neural Network Target Prediction Problems

by Wang, Yiying , Yang, Boxin , Li, Jinghua in autoencoder , Bayesian optimization , Data processing

2024

Neural network models, such as BP, LSTM, etc., support only numerical inputs, so data preprocessing needs to be carried out on the categorical variables to convert them into numerical data. For unordered multi-categorical variables, existing encoding methods may produce dimensional catastrophes and may also introduce additional order misrepresentation and distance bias in neural network computation. To solve the above problems, this paper proposes an unordered multi-categorical variable encoding method O-AE using orthogonal matrix for encoding and encoding representation learning and dimensionality reduction via an autoencoder. Bayesian optimization is used for hyperparameter optimization of the autoencoder. Finally, seven experiments were designed with the basic O-AE, Bayesian optimization of the hyperparameters of the autoencoder for O-AE, and other encoding methods to encode unordered multi-categorical variables in five datasets, and they were input into a BP neural network to carry out target prediction experiments. The results show that the experiments using O-AE and O-AE-b have better prediction results, proving that the method proposed in this paper is highly feasible and applicable and can be an optional method for the data processing of unordered multi-categorical variables.

Journal Article

Share this book

Add to My Shelf

Encoding a Categorical Independent Variable for Input to TerrSet’s Multi-Layer Perceptron

by Evenden, Emily , Pontius Jr, Robert Gilmore in Algorithms , categorical variable , Continuity (mathematics)

2021

The profession debates how to encode a categorical variable for input to machine learning algorithms, such as neural networks. A conventional approach is to convert a categorical variable into a collection of binary variables, which causes a burdensome number of correlated variables. TerrSet’s Land Change Modeler proposes encoding a categorical variable onto the continuous closed interval from 0 to 1 based on each category’s Population Evidence Likelihood (PEL) for input to the Multi-Layer Perceptron, which is a type of neural network. We designed examples to test the wisdom of these encodings. The results show that encoding a categorical variable based on each category’s Sample Empirical Probability (SEP) produces results similar to binary encoding and superior to PEL encoding. The Multi-Layer Perceptron’s sigmoidal smoothing function can cause PEL encoding to produce nonsensical results, while SEP encoding produces straightforward results. We reveal the encoding methods by illustrating how a dependent variable gains across an independent variable that has four categories. The results show that PEL can differ substantially from SEP in ways that have important implications for practical extrapolations. If users must encode a categorical variable for input to a neural network, then we recommend SEP encoding, because SEP efficiently produces outputs that make sense.

Journal Article

Share this book

Add to My Shelf

Optimize the Combination of Categorical Variable Encoding and Deep Learning Technique for the Problem of Prediction of Vietnamese Student Academic Performance

by Hien, Do Thi Thu , Kim, Tran , The, Dao in Academic achievement , Algorithms , Artificial neural networks

2020

Deep learning techniques have been successfully applied in many technical fields such as computer vision and natural language processing, and recently researchers have paid much attention to the application of this technology in socio-economic problems including the student academic performance prediction (SAPP) problem. In this specialization, this study focusses on both designing an appropriate Deep learning model and handling categorical input variables. In fact, categorical data variables are quite popular in student academic performance prediction problem, and deep learning technique in particular or artificial neural network in general only work well with numerical data variables. Therefore, this study investigates the performance of the combination categorical encoding methods including label encoding, one-hot encoding and “learned” embedding encoding with deep learning techniques including Deep Dense neural network and Long short-term memory neural network for SAPP problem. In experiment, this study compared these proposed models with each other and with some prediction methods based on other machine learning algorithms at the same time. The results showed that the categorical data transformation method using the “learned” embedding encoding improved performance of the deep learning models, and its combination with long short-term memory network gave an outstanding result for the researched problem.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter