Catalogue Search | MBRL

Sparse partial least squares regression for simultaneous dimension reduction and variable selection

by Chun, Hyonho , Keleş, Sündüz in Attention , Biostatistics , Chromatin immuno-precipitation

2010

Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.

Journal Article

Share this book

Add to My Shelf

Estimating Soil Organic Matter Content in Desert Areas Using In Situ Hyperspectral Data and Feature Variable Selection Algorithms in Southern Xinjiang, China

by Hu, Bifeng , Peng, Jie , Luo, Defang in Accuracy , Agricultural land , Algorithms

2022

Soil organic matter (SOM) is a key factor for evaluating soil fertility. Rapidly monitoring organic matter content in desert soil can provide a scientific basis for the rational development and utilization of reserve arable land resources. Although spectral inversion accuracy for SOM under laboratory-controlled conditions is high, it is time-consuming and costly compared to the in situ spectroscopic determination method. However, in situ spectroscopy causes losses in accuracy due to interference from external environmental factors (e.g., the surface roughness of soil, changes in weather conditions, atmospheric water vapor, etc.). Therefore, reducing or removing the interference of external environmental factors to improve the accuracy of in situ spectroscopy for estimating SOM is challenging. In this study, visible and near-infrared (Vis-NIR) in situ spectral data were collected from 135 topsoil (0–20 cm) samples in a desert area of northwestern China, and organic matter content was measured. Three spectral pre-processing methods—the standard normal transform (SNV), reciprocal logarithm (log(1/R)) and normalization (NOR)—combined with three feature variable selection methods—the particle swarm algorithm (PSO), ant colony algorithm (ACO) and simulated annealing (SA) algorithm—were used to filter the spectral feature bands of SOM, and then partial least squares regression (PLSR), a back propagation neural network (BPNN) and a convolutional neural network (CNN) were used to construct the estimation models of SOM. The results indicated that the SNV could enhance the spectral information related to SOM and improve the accuracy of model estimation, and it was one of the most effective spectral pretreatment methods. Compared with the model constructed with the full-band spectroscopy method, the feature variable selection method could effectively improve the estimation accuracy of the Vis-NIR in situ spectroscopy model. The most obvious improvement was found with PSO, where R2 and RPD were improved by more than 0.34 and 0.16, respectively, and RMSE was reduced by more than 0.29 g kg−1. The accuracy of the CNN model was higher than that of the BPNN and PLSR models, both for the inversion model of SOM built from full-band spectral data and the bands selected by the characteristic variable selection method. SNV-PSO-CNN is the optimal hybrid model for in situ spectral measurement of SOM (R2 = 0.71, RPD = 1.88, RMSE = 1.67 g kg−1) and can realize the quantitative in situ spectral inversion of SOM in desert soils.

Journal Article

Share this book

Add to My Shelf

Improving the Accuracy of Soil Organic Carbon Estimation: CWT-Random Frog-XGBoost as a Prerequisite Technique for In Situ Hyperspectral Analysis

by Yang, Jixiang , Ma, Xiaofei , Li, Xinguo in Accuracy , Agricultural industry , Algorithms

2023

Rapid and accurate measurement of the soil organic carbon (SOC) content is a pre-condition for sustainable grain production and land development, and contributes to carbon neutrality in the agricultural industry. To provide technical support for the development and utilization of land resources, the SOC content can be estimated using Vis-NIR diffuse reflectance spectroscopy. However, the spectral redundancy and co-linearity issues of Vis-NIR spectra pose extreme challenges for spectral analysis and model construction. This study compared the effects of different pre-processing methods and feature variable algorithms on the estimation of the SOC content. To this end, in situ hyperspectral data and soil samples were collected from the lakeside oasis of Bosten Lake in Xinjiang, China. The results showed that the combination of continuous wavelet transform (CWT)-random frog could rapidly estimate the SOC content with excellent estimation accuracy (R2 of 0.65–0.86). The feature variable selection algorithm effectively improved the estimation accuracy (average improvement of (0.30–0.48); based on their ability to improve model estimation on average, the algorithms can be ranked as follows: particle swarm optimization (PSO) > ant colony optimization (ACO) > random frog > Boruta > simulated annealing (SA) > successive projections algorithm (SPA). The CWT-XGBoost model based on random frog showed the best results, with R2 = 0.86, RMSE = 2.44, and RPD = 2.78. The feature bands accounted for only 0.57% of the Vis-NIR bands, and the most important sensitive bands were distributed at 755–1195 nm, 1602 nm, 1673 nm, and 2213 nm. These findings are of significance for the extraction of precise information on lakeside oases in arid areas, which would aid in achieving human–land sustainability.

Journal Article

Share this book

Add to My Shelf

Hybrid binary whale with harris hawks for feature selection

by Alqushaibi, Alawi , Abdulkadir, Said Jadid , Al Hussian, Hitham in Accuracy , Algorithms , Artificial Intelligence

2022

A tremendous flow of big data has come from the growing use of digital technology and intelligent systems. This has resulted in an increase in not just the dimensional issues that big data encounters, but also the number of challenges that big data faces, including redundancies and useless features. As a result, feature selection is offered as a method for eliminating unwanted characteristics. This study introduces the BWOAHHO memetic technique, which combines the binary hybrid Whale Optimization Algorithm (WOA) with Harris Hawks Optimization (HHO). A transfer function to transfer continuous characteristics to binary to fulfill the feature selection nature condition. The efficiency of the selected attributes is assessed using a wrapper k-Nearest neighbor (KNN) Classifier. About 18 benchmark datasets obtained from UCI repository were utilized to measure the proposed method’s proficiency. The performance of the novel hybrid technique was evaluated by comparing to that of WOA, HHO, Particle Swarm Optimization (PSO), the Genetic Algorithm (GA), and the WOASAT-2. With the new hybrid feature selection method, the WOA algorithm’s efficiency was improved. Classification accuracy, average fitness, average selected attributes, and computational time were all used as performance indicators. In terms of accuracy, the proposed BWOHHO algorithm compared with 5 similar metaheuristic algorithms. The BWOAHHO had a classification accuracy of 92% in the 18 datasets, which was higher than BWOA (90%), BPSO (82%), and BGA (82%). (83%), the fitness measures of the BWOHHO algorithm are 0.08, which is lower than the average fitness of BWOA, BPSO, and BGA., in terms of selected attribute size compare the proposed BWOAHHO algorithm to the results obtained by the other five techniques The average selected feature sizes for BWOAHHO, BWOA, BHHO, BGA, and WOASAT-2 were 18.07, 20.12, 22.3, 22.32, 22.40, and 15.99, respectively, and computing time for the proposed BWOHHO was 7.36 in second which was the lowest computed value. To determine the significance of BWOAHHO, a statistical one-way ANOVA test was used. When compared to existing algorithms, the proposed approach produced better results.

Journal Article

Share this book

Add to My Shelf

Feature Selection in High-Dimensional Models via EBIC with Energy Distance Correlation

by Ocloo, Isaac Xoese , Chen, Hanfeng in Algorithms , Correlation coefficients , Energy

2022

In this paper, the LASSO method with extended Bayesian information criteria (EBIC) for feature selection in high-dimensional models is studied. We propose the use of the energy distance correlation in place of the ordinary correlation coefficient to measure the dependence of two variables. The energy distance correlation detects linear and non-linear association between two variables, unlike the ordinary correlation coefficient, which detects only linear association. EBIC is adopted as the stopping criterion. It is shown that the new method is more powerful than Luo and Chen’s method for feature selection. This is demonstrated by simulation studies and illustrated by a real-life example. It is also proved that the new algorithm is selection-consistent.

Journal Article

Share this book

Add to My Shelf

Mapping the Growing Stem Volume of the Coniferous Plantations in North China Using Multispectral Data from Integrated GF-2 and Sentinel-2 Images and an Optimized Feature Variable Selection Method

by Li, Xinyu , Long, Jiangping , Xu, Xiaodong in Accuracy , adaptive feature variable selection , Algorithms

2021

Accurate measurement of forest growing stem volume (GSV) is important for forest resource management and ecosystem dynamics monitoring. Optical remote sensing imagery has great application prospects in forest GSV estimation on regional and global scales as it is easily accessible, has a wide coverage, and mature technology. However, their application is limited by cloud coverage, data stripes, atmospheric effects, and satellite sensor errors. Combining multi-sensor data can reduce such limitations as it increases the data availability, but also causes the multi-dimensional problem that increases the difficulty of feature selection. In this study, GaoFen-2 (GF-2) and Sentinel-2 images were integrated, and feature variables and data scenarios were derived by a proposed adaptive feature variable combination optimization (AFCO) program for estimating the GSV of coniferous plantations. The AFCO algorithm was compared to four traditional feature variable selection methods, namely, random forest (RF), stepwise random forest (SRF), fast iterative feature selection method for k-nearest neighbors (KNN-FIFS), and the feature variable screening and combination optimization procedure based on the distance correlation coefficient and k-nearest neighbors (DC-FSCK). The comparison indicated that the AFCO program not only considered the combination effect of feature variables, but also optimized the selection of the first feature variable, error threshold, and selection of the estimation model. Furthermore, we selected feature variables from three datasets (GF-2, Sentinel-2, and the integrated data) following the AFCO and four other feature selection methods and used the k-nearest neighbors (KNN) and random forest regression (RFR) to estimate the GSV of coniferous plantations in northern China. The results indicated that the integrated data improved the GSV estimation accuracy of coniferous plantations, with relative root mean square errors (RMSErs) of 15.0% and 19.6%, which were lower than those of GF-2 and Sentinel-2 data, respectively. In particular, the texture feature variables derived from GF-2 red band image have a significant impact on GSV estimation performance of the integrated dataset. For most data scenarios, the AFCO algorithm gained more accurate GSV estimates, as the RMSErs were 30.0%, 23.7%, 17.7%, and 17.5% lower than those of RF, SRF, KNN-FIFS, and DC-FSCK, respectively. The GSV distribution map obtained by the AFCO method and RFR model matched the field observations well. This study provides some insight into the application of optical images, optimization of the feature variable combination, and modeling algorithm selection for estimating the GSV of coniferous plantations.

Journal Article

Share this book

Add to My Shelf

Extraction of Important Factors in a High-Dimensional Data Space: An Application for High-Growth Firms

by Misako Takayasu , Hideki Takayasu , Takuya Wada in Accuracy , Analysis , Astrophysics

2023

We introduce a new non-black-box method of extracting multiple areas in a high-dimensional big data space where data points that satisfy specific conditions are highly concentrated. First, we extract one-dimensional areas where the data that satisfy specific conditions are mostly gathered by using the Bayesian method. Second, we construct higher-dimensional areas where the densities of focused data points are higher than the simple combination of the results for one dimension, and then we verify the results through data validation. Third, we apply this method to estimate the set of significant factors shared in successful firms with growth rates in sales at the top 1% level using 156-dimensional data of corporate financial reports for 12 years containing about 320,000 firms. We also categorize high-growth firms into 15 groups of different sets of factors.

Journal Article

Share this book

Add to My Shelf

Identification of Peanut Kernels Infected with Multiple Aspergillus flavus Fungi Using Line-Scan Raman Hyperspectral Imaging

by Huang, Wenqian , Yang, Guang , An, Ting in Accuracy , Analytical Chemistry , Aspergillus flavus

2024

The mold contamination caused by Aspergillus flavus poses a serious threat to food safety. In this study, three artificially inoculating strains of Aspergillus flavus ( A. flavus 142,801, A. flavus 142,803, A. flavus 336,156) were used to infect two healthy peanut varieties (variety A: GS1210, variety B: fengyingluohan) kernels. These healthy and Aspergillus flavus -infected peanut kernels were identified and differentiated by using a line-scan Raman hyperspectral imaging system. Firstly, the average spectra of healthy and infected peanuts were extracted, followed by preprocessing using Savitzky-Golay smoothing and airPLS for fluorescence background removal. Finally, four feature variable selection methods were used to optimize the models. In the binary classification model (healthy vs. A. flavus ), the SVM method yielded the best modeling results, with accuracy above 99%. The best accuracy achieved in the three-classification model for mold on variety A peanut was 88.9%, and for variety B, it was 92.4%. In the model for mold on a mixture of both varieties, the highest accuracy reached was 74.8%. The results show that line-scan Raman hyperspectral imaging technology is practical in identifying healthy and Aspergillus flavus -infected peanut kernels. Moreover, this technique has great potential in identifying different Aspergillus flavus of a single peanut variety and provides a feasible method for fungal species identification.

Journal Article

Share this book

Add to My Shelf

The method based on ATR‐FTIR spectroscopy combined with feature variable selection for the boletus species and origins identification

by Wang, Yuanzhong , Liu, Honggao , Ji, Zhiyi in Accuracy , Algorithms , Basidiocarps

2024

Wild boletus mushrooms, which are macrofungi of the phylum Basidiomycetes, are a nutritious and unique natural food that is widely enjoyed. Since boletus are consumed with problems of indistinguishable toxic and non‐toxic species and heavy metal enrichment, their species identification and traceability are crucial in ensuring quality and safety of consumption. In this study, the attenuated total reflection Fourier transform infrared (ATR‐FTIR) spectroscopy technique combined with three feature variable extraction methods, manual selection method, semi‐manual selection method, and algorithm method, were used to improve the accuracy and computational speed of the model identification, and the models were established for the identification of boletus species with an accuracy of up to 100% as well as for the identification of boletus origin with an accuracy of 86.36%. It was found that the best methods to improve the accuracy of the models were semi‐manual selection, manual selection and algorithmic selection in that order. This study can provide rapid and accurate species identification and origin traceability of wild boletus, and provide theoretical basis for the rational use of feature variable selection methods. The species and origins of wild boletus were identified by FT‐MIR combined with LIBSVM models and three feature variable selection methods. It was found that the best feature variable selection methods to improve the accuracy of the models were semi‐manual selection, manual selection, and algorithm selection in that order. The feature variable selection method should be chosen with care and the importance of a wider spectral range should be acknowledged.

Journal Article

Share this book

Add to My Shelf

Inversion of Coniferous Forest Stock Volume Based on Backscatter and InSAR Coherence Factors of Sentinel-1 Hyper-Temporal Images and Spectral Variables of Landsat 8 OLI

by Zheng, Huanna , Li, Xinyu , Long, Jiangping in Accuracy , Algorithms , Altitude

2022

Forest stock volume (FSV) is a basic data source for estimating forest carbon sink. It is also a crucial parameter that reflects the quality of forest resources and forest management level. The use of remote sensing data combined with a support vector regression (SVR) algorithm has been widely used in FSV estimation. However, due to the complexity and spatial heterogeneity of the forest biological community, in the FSV high-value area with dense vegetation, the optical re-mote sensing variables tend to be saturated, and the sensitivity of synthetic aperture radar (SAR) backscattering features to the FSV is significantly reduced. These factors seriously affect the ac-curacy of the FSV estimation. In this study, Landsat 8 (L8) Operational Land Imager multispectral images and C-band Sentinel-1 (S1) hyper-temporal SAR images were used to extract three re-mote sensing feature datasets: spectral variables (L8), backscattering coefficients (S1), and inter-ferometric SAR factors (S1-InSAR). We proposed a feature selection method based on SVR (FS-SVR) and compared the FSV estimation performance of FS-SVR and stepwise regression analysis (SRA) on the aforementioned three remote sensing feature datasets. Finally, an estima-tion model of coniferous FSV was constructed using the SVR algorithm in Wangyedian Forest Farm, Inner Mongolia, China, and the spatial distribution map of coniferous FSV was predicted. The experimental results show the following: (1) The coherence amplitude and DSM data ob-tained based on S1 images contain information relat-ed to forest canopy height, and the hy-per-temporal S1 image data significantly enrich the diversity of S1-InSAR feature factors. There-fore, the S1-InSAR dataset has a better FSV response than remote sensing factors such as the S1 backscattering coefficient and L8 vegetation index, and the corresponding root mean square er-ror (RMSE) and relative RMSE (rRMSE) values reached 47.6 m3/ha and 20.9%, respectively. (2) The integrated dataset can provide full play to the synergy of the L8, S1, and S1-InSAR remote sensing data. Its RMSE and rRMSE values are 44.3 m3/ha and 19.4% respectively. (3) The proposed FS-SVR method can better select remote sensing variables suitable for FSV estimation than SRA. The average value of the rRMSE (23.17%) based on the three datasets was 13.8% lower than that of the SRA method (26.87%). This study provides new insights into forest FSV retrieval based on active and passive multisource remote sensing joint data.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter