Catalogue Search | MBRL

Selecting training sets for support vector machines: a review

by Nalepa, Jakub , Kawulok, Michal in Algorithms , Artificial intelligence , Big Data

2019

Support vector machines (SVMs) are a supervised classifier successfully applied in a plethora of real-life applications. However, they suffer from the important shortcomings of their high time and memory training complexities, which depend on the training set size. This issue is especially challenging nowadays, since the amount of data generated every second becomes tremendously large in many domains. This review provides an extensive survey on existing methods for selecting SVM training data from large datasets. We divide the state-of-the-art techniques into several categories. They help understand the underlying ideas behind these algorithms, which may be useful in designing new methods to deal with this important problem. The review is complemented with the discussion on the future research pathways which can make SVMs easier to exploit in practice.

Journal Article

Share this book

Add to My Shelf

Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy

by Taylor, Julian , Kuchel, Haydn , Norman, Adam in Accuracy

2018

Genomic selection applied to plant breeding enables earlier estimates of a line’s performance and significant reductions in generation interval. Several factors affecting prediction accuracy should be well understood if breeders are to harness genomic selection to its full potential. We used a panel of 10,375 bread wheat (Triticum aestivum) lines genotyped with 18,101 SNP markers to investigate the effect and interaction of training set size, population structure and marker density on genomic prediction accuracy. Through assessing the effect of training set size we showed the rate at which prediction accuracy increases is slower beyond approximately 2,000 lines. The structure of the panel was assessed via principal component analysis and K-means clustering, and its effect on prediction accuracy was examined through a novel cross-validation analysis according to the K-means clusters and breeding cohorts. Here we showed that accuracy can be improved by increasing the diversity within the training set, particularly when relatedness between training and validation sets is low. The breeding cohort analysis revealed that traits with higher selection pressure (lower allelic diversity) can be more accurately predicted by including several previous cohorts in the training set. The effect of marker density and its interaction with population structure was assessed for marker subsets containing between 100 and 17,181 markers. This analysis showed that response to increased marker density is largest when using a diverse training set to predict between poorly related material. These findings represent a significant resource for plant breeders and contribute to the collective knowledge on the optimal structure of calibration panels for genomic prediction.

Journal Article

Share this book

Add to My Shelf

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

by Yadav, Sumedh , Bode, Mathis in Accuracy , Algorithms , Approximation

2019

A scalable graphical method is presented for selecting and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is succeeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method consists of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is a significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristics available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for a partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

Journal Article

Share this book

Add to My Shelf

An Empirical Analysis of Data Requirements for Financial Forecasting with Neural Networks

by Walczak, Steven in Artificial neural networks , Datasets , FORECASTING FOREIGN EXCHANGE NEURAL NETWORKS PREDICTION ACCURACY TIME SERIES TRAINING SET SIZE

2001

Neural networks have been shown to be a promising tool for forecasting financial time series. Several design factors significantly impact the accuracy of neural network forecasts. These factors include selection of input variables, architecture of the network, and quantity of training data. The questions of input variable selection and system architecture design have been widely researched, but the corresponding question of how much information to use in producing high-quality neural network models has not been adequately addressed. In this paper, the effects of different sizes of training sample sets on forecasting currency exchange rates are examined. It is shown that those neural networks-given an appropriate amount of historical knowledge-can forecast future currency exchange rates with 60 percent accuracy, while those neural networks trained on a larger training set have a worse forecasting performance. In addition to higher-quality forecasts, the reduced training set sizes reduce development cost and time.

Journal Article

Share this book

Add to My Shelf

High-Performance Large-Scale SVM-based Multiclass Classification

by Kurbakov, Mikhail Yu , Sulimova, Valentina V. in Classification , Sampling methods , Support vector machines

2025

A typical characteristic of modern applied multiclass classification problems is the large scale. It significantly complicates or even makes it impossible an application of such popular, convenient and well-interpreted method as Support Vector Machines (SVM), which is well-proven for small-size classification problems. In this connection the actual problem is to increase SVM’s computational performance. The Double-Layer Smart Sampling SVM (DLSS-SVM) method allows to reduce the training time of multiclass SVM via double using the smart sampling technique. This paper proposes the high-performance version of DLSS-SVM (HP-DLSS-SVM). It is based on two-level parallel computing scheme, which exploits useful DLSS-SVM properties and computing system capabilities more fully. Experimental investigation of the proposed HPDLSS-SVM method was made on three large handwritten digit images data sets of different size. Experiments show that the proposed approach allows to essentially decrease training and testing times and at that to maintain the obtained recognition accuracy close to the best.

Journal Article

Share this book

Add to My Shelf

Desirable and undesirable difficulties: Influences of variability, training schedule, and aptitude on nonnative phonetic learning

by Myers, Emily B. , Fuhrmeister, Pamela in Adult Basic Education , Adult Students , Aptitude

2020

Adult listeners often struggle to learn to distinguish speech sounds not present in their native language. High-variability training sets (i.e., stimuli produced by multiple talkers or stimuli that occur in diverse phonological contexts) often result in better retention of the learned information, as well as increased generalization to new instances. However, high-variability training is also more challenging, and not every listener can take advantage of this kind of training. An open question is how variability should be introduced to the learner in order to capitalize on the benefits of such training without derailing the training process. The current study manipulated phonological variability as native English speakers learned a difficult nonnative (Hindi) contrast by presenting the nonnative contrast in the context of two different vowels (/i/ and /u/). In a between-subjects design, variability was manipulated during training and during test. Participants were trained in the evening hours and returned the next morning for reassessment to test for retention of the speech sounds. We found that blocked training was superior to interleaved training for both learning and retention, but for learners in the interleaved training group, higher pretraining aptitude predicted better identification performance. Further, pretraining discrimination aptitude positively predicted changes in phonetic discrimination after a period of off-line consolidation, regardless of the training manipulation. These findings add to a growing literature suggesting that variability may come at a cost in phonetic learning and that aptitude can affect both learning and retention of nonnative speech sounds.

Journal Article

Share this book

Add to My Shelf

Text data augmentation for large language models: a comprehensive survey of methods, challenges, and opportunities

by Qin, Joe S. , Chai, Yaping , Xie, Haoran in Artificial Intelligence , Augmentation , Chatbots

2025

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly lead the model to overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recently, promising retrieval-based techniques have further enhanced the expressive performance of LLMs in data augmentation by introducing external knowledge, enabling them to produce more grounded data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation, and Hybrid Augmentation. Additionally, we conduct extensive experiments across four techniques, systematically compare and analyse their performance, and provide key insights. Following this, we connect data augmentation with three critical optimisation techniques. Finally, we introduce existing challenges and future opportunities that could further improve data augmentation. This survey provides researchers and practitioners of the text modality with avenues to address data scarcity and improve data quality, helping scholars understand the evolution of text data augmentation from traditional methods to the application of human-like generation and agent search in the era of LLMs.

Journal Article

Share this book

Add to My Shelf

Genome Properties and Prospects of Genomic Prediction of Hybrid Performance in a Breeding Program of Maize

by Technow, Frank , Melchinger, Albrecht E , Bauer, Eva in Breeding , Corn , Gene Frequency

2014

Maize (Zea mays L.) serves as model plant for heterosis research and is the crop where hybrid breeding was pioneered. We analyzed genomic and phenotypic data of 1254 hybrids of a typical maize hybrid breeding program based on the important Dent × Flint heterotic pattern. Our main objectives were to investigate genome properties of the parental lines (e.g., allele frequencies, linkage disequilibrium, and phases) and examine the prospects of genomic prediction of hybrid performance. We found high consistency of linkage phases and large differences in allele frequencies between the Dent and Flint heterotic groups in pericentromeric regions. These results can be explained by the Hill–Robertson effect and support the hypothesis of differential fixation of alleles due to pseudo-overdominance in these regions. In pericentromeric regions we also found indications for consistent marker–QTL linkage between heterotic groups. With prediction methods GBLUP and BayesB, the cross-validation prediction accuracy ranged from 0.75 to 0.92 for grain yield and from 0.59 to 0.95 for grain moisture. The prediction accuracy of untested hybrids was highest, if both parents were parents of other hybrids in the training set, and lowest, if none of them were involved in any training set hybrid. Optimizing the composition of the training set in terms of number of lines and hybrids per line could further increase prediction accuracy. We conclude that genomic prediction facilitates a paradigm shift in hybrid breeding by focusing on the performance of experimental hybrids rather than the performance of parental lines in testcrosses.

Journal Article

Share this book

Add to My Shelf

Selecting More Informative Training Sets with Fewer Observations

by Kaufman, Aaron R. in Accuracy , Algorithms , Assistants

2024

A standard text-as-data workflow in the social sciences involves identifying a set of documents to be labeled, selecting a random sample of them to label using research assistants, training a supervised learner to label the remaining documents, and validating that model’s performance using standard accuracy metrics. The most resource-intensive component of this is the hand-labeling: carefully reading documents, training research assistants, and paying human coders to label documents in duplicate or more. We show that hand-coding an algorithmically selected rather than a simple-random sample can improve model performance above baseline by as much as 50%, or reduce hand-coding costs by up to two-thirds, in applications predicting (1) U.S. executive-order significance and (2) financial sentiment on social media. We accompany this manuscript with open-source software to implement these tools, which we hope can make supervised learning cheaper and more accessible to researchers.

Journal Article

Share this book

Add to My Shelf

Early Warning and Prediction of Scarlet Fever in China Using the Baidu Search Index and Autoregressive Integrated Moving Average With Explanatory Variable (ARIMAX) Model: Time Series Analysis

by Zhou, Jie , Cui, Ping , Luo, Tingyan in Analysis , Averages , Communicable diseases

2023

Internet-derived data and the autoregressive integrated moving average (ARIMA) and ARIMA with explanatory variable (ARIMAX) models are extensively used for infectious disease surveillance. However, the effectiveness of the Baidu search index (BSI) in predicting the incidence of scarlet fever remains uncertain. Our objective was to investigate whether a low-cost BSI monitoring system could potentially function as a valuable complement to traditional scarlet fever surveillance in China. ARIMA and ARIMAX models were developed to predict the incidence of scarlet fever in China using data from the National Health Commission of the People’s Republic of China between January 2011 and August 2022. The procedures included establishing a keyword database, keyword selection and filtering through Spearman rank correlation and cross-correlation analyses, construction of the scarlet fever comprehensive search index (CSI), modeling with the training sets, predicting with the testing sets, and comparing the prediction performances. The average monthly incidence of scarlet fever was 4462.17 (SD 3011.75) cases, and annual incidence exhibited an upward trend until 2019. The keyword database contained 52 keywords, but only 6 highly relevant ones were selected for modeling. A high Spearman rank correlation was observed between the scarlet fever reported cases and the scarlet fever CSI (r[sub.s]=0.881). We developed the ARIMA(4,0,0)(0,1,2)[sub.(12)] model, and the ARIMA(4,0,0)(0,1,2)[sub.(12)] + CSI (Lag=0) and ARIMAX(1,0,2)(2,0,0)[sub.(12)] models were combined with the BSI. The 3 models had a good fit and passed the residuals Ljung-Box test. The ARIMA(4,0,0)(0,1,2)[sub.(12)], ARIMA(4,0,0)(0,1,2)[sub.(12)] + CSI (Lag=0), and ARIMAX(1,0,2)(2,0,0)[sub.(12)] models demonstrated favorable predictive capabilities, with mean absolute errors of 1692.16 (95% CI 584.88-2799.44), 1067.89 (95% CI 402.02-1733.76), and 639.75 (95% CI 188.12-1091.38), respectively; root mean squared errors of 2036.92 (95% CI 929.64-3144.20), 1224.92 (95% CI 559.04-1890.79), and 830.80 (95% CI 379.17-1282.43), respectively; and mean absolute percentage errors of 4.33% (95% CI 0.54%-8.13%), 3.36% (95% CI –0.24% to 6.96%), and 2.16% (95% CI –0.69% to 5.00%), respectively. The ARIMAX models outperformed the ARIMA models and had better prediction performances with smaller values. This study demonstrated that the BSI can be used for the early warning and prediction of scarlet fever, serving as a valuable supplement to traditional surveillance systems.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter