Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Series TitleSeries Title
-
Reading LevelReading Level
-
YearFrom:-To:
-
More FiltersMore FiltersContent TypeItem TypeIs Full-Text AvailableSubjectPublisherSourceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
111,370
result(s) for
"CLUSTER ANALYSIS"
Sort by:
Defining clusters of related industries
by
Stern, Scott
,
Delgado, Mercedes
,
Porter, Michael E.
in
Algorithms
,
Cluster analysis
,
Input output analysis
2016
Clusters are geographic concentrations of industries related by knowledge, skills, inputs, demand and/or other linkages. There is an increasing need for cluster-based data to support research, facilitate comparisons of clusters across regions and support policymakers in defining regional strategies. This article develops a novel clustering algorithm that systematically generates and assesses sets of cluster definitions (i.e., groups of closely related industries). We implement the algorithm using 2009 data for U.S. industries (six-digit NAICS), and propose a new set of benchmark cluster definitions that incorporates measures of inter-industry linkages based on co-location patterns, input–output links, and similarities in labor occupations. We also illustrate the algorithm’s ability to compare alternative sets of cluster definitions by evaluating our new set against existing sets in the literature. We find that our proposed set outperforms other methods in capturing a wide range of inter-industry linkages, including the grouping of industries within the same three-digit NAICS.
Journal Article
A Practitioner's Guide to Cluster-Robust Inference
2015
We consider statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters. Examples include data on individuals with clustering on village or region or other category such as industry, and state-year differences-in-differences studies with clustering on state. In such settings, default standard errors can greatly overstate estimator precision. Instead, if the number of clusters is large, statistical inference after OLS should be based on cluster-robust standard errors. We outline the basic method as well as many complications that can arise in practice. These include cluster-specific fixed effects, few clusters, multiway clustering, and estimators other than OLS.
Journal Article
WILD BOOTSTRAP INFERENCE FOR WILDLY DIFFERENT CLUSTER SIZES
2017
The cluster robust variance estimator (CRVE) relies on the number of clusters being sufficiently large. Monte Carlo evidence suggests that the ‘rule of 42’ is not true for unbalanced clusters. Rejection frequencies are higher for datasets with 50 clusters proportional to US state populations than with 50 balanced clusters. Using critical values based on the wild cluster bootstrap performs much better. However, this procedure fails when a small number of clusters is treated. We explain why CRVE t statistics and the wild bootstrap fail in this case, study the ‘effective number’ of clusters and simulate placebo laws with dummy variable regressors.
Journal Article
A promoter-level mammalian expression atlas
by
Jørgensen, Mette
,
Plessy, Charles
,
Chierici, Marco
in
631/114/2114
,
631/208/200
,
631/337/2019
2014
Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly ‘housekeeping’, whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research.
A study from the FANTOM consortium using single-molecule cDNA sequencing of transcription start sites and their usage in human and mouse primary cells, cell lines and tissues reveals insights into the specificity and diversity of transcription patterns across different mammalian cell types.
Mapping the human transcription
FANTOM5 (standing for functional annotation of the mammalian genome 5) is the fifth major stage of a major international collaboration that aims to dissect the transcriptional regulatory networks that define every human cell type. Two Articles in this issue of
Nature
present some of the project's latest results. The first paper uses the FANTOM5 panel of tissue and primary cell samples to define an atlas of active,
in vivo
bidirectionally transcribed enhancers across the human body. These authors show that bidirectional capped RNAs are a signature feature of active enhancers and identify more than 40,000 enhancer candidates from over 800 human cell and tissue samples. The enhancer atlas is used to compare regulatory programs between different cell types and identify disease-associated regulatory SNPs, and will be a resource for studies on cell-type-specific enhancers. In the second paper, single-molecule sequencing is used to map human and mouse transcription start sites and their usage in a panel of distinct human and mouse primary cells, cell lines and tissues to produce the most comprehensive mammalian gene expression atlas to date. The data provide a plethora of insights into open reading frames and promoters across different cell types in addition to valuable annotation of mammalian cell-type-specific transcriptomes.
Journal Article
Cluster randomised trials with a binary outcome and a small number of clusters: comparison of individual and cluster level analysis method
by
Hayes, Richard J.
,
Thompson, Jennifer A.
,
Fielding, Katherine L.
in
Bias
,
Binomial distribution
,
Clinical trials
2022
Background
Cluster randomised trials (CRTs) are often designed with a small number of clusters, but it is not clear which analysis methods are optimal when the outcome is binary. This simulation study aimed to determine (i) whether cluster-level analysis (CL), generalised linear mixed models (GLMM), and generalised estimating equations with sandwich variance (GEE) approaches maintain acceptable type-one error including the impact of non-normality of cluster effects and low prevalence, and if so (ii) which methods have the greatest power. We simulated CRTs with 8–30 clusters, altering the cluster-size, outcome prevalence, intracluster correlation coefficient, and cluster effect distribution. We analysed each dataset with weighted and unweighted CL; GLMM with adaptive quadrature and restricted pseudolikelihood; GEE with Kauermann-and-Carroll and Fay-and-Graubard sandwich variance using independent and exchangeable working correlation matrices. P-values were from a t-distribution with degrees of freedom (DoF) as clusters minus cluster-level parameters; GLMM pseudolikelihood also used Satterthwaite and Kenward-Roger DoF.
Results
Unweighted CL, GLMM pseudolikelihood, and Fay-and-Graubard GEE with independent or exchangeable working correlation matrix controlled type-one error in > 97% scenarios with clusters minus parameters DoF. Cluster-effect distribution and prevalence of outcome did not usually affect analysis method performance. GEE had the least power. With 20–30 clusters, GLMM had greater power than CL with varying cluster-size but similar power otherwise; with fewer clusters, GLMM had lower power with common cluster-size, similar power with medium variation, and greater power with large variation in cluster-size.
Conclusion
We recommend that CRTs with ≤ 30 clusters and a binary outcome use an unweighted CL or restricted pseudolikelihood GLMM both with DoF clusters minus cluster-level parameters.
Journal Article
Dynamic Tensor Clustering
2019
Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a general-order tensor. There is also a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we propose a new dynamic tensor clustering method that works for a general-order dynamic tensor, and enjoys both strong statistical guarantee and high computational efficiency. Our proposal is based on a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. Theoretically, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with high probability. Moreover, our proposed method can be naturally extended to co-clustering of multiple modes of the tensor data. The efficacy of our method is illustrated through simulations and a brain dynamic functional connectivity analysis from an autism spectrum disorder study.
Supplementary materials
for this article are available online.
Journal Article
In simulated data and health records, latent class analysis was the optimum multimorbidity clustering algorithm
by
Barrett, Jessica
,
Griffin, Simon
,
Yau, Christopher
in
Algorithms
,
Cluster Analysis
,
Clustering
2022
To investigate the reproducibility and validity of latent class analysis (LCA) and hierarchical cluster analysis (HCA), multiple correspondence analysis followed by k-means (MCA-kmeans) and k-means (kmeans) for multimorbidity clustering.
We first investigated clustering algorithms in simulated datasets with 26 diseases of varying prevalence in predetermined clusters, comparing the derived clusters to known clusters using the adjusted Rand Index (aRI). We then them investigated the medical records of male patients, aged 65 to 84 years from 50 UK general practices, with 49 long-term health conditions. We compared within cluster morbidity profiles using the Pearson correlation coefficient and assessed cluster stability using in 400 bootstrap samples.
In the simulated datasets, the closest agreement (largest aRI) to known clusters was with LCA and then MCA-kmeans algorithms. In the medical records dataset, all four algorithms identified one cluster of 20–25% of the dataset with about 82% of the same patients across all four algorithms. LCA and MCA-kmeans both found a second cluster of 7% of the dataset. Other clusters were found by only one algorithm. LCA and MCA-kmeans clustering gave the most similar partitioning (aRI 0.54).
LCA achieved higher aRI than other clustering algorithms.
Journal Article
Integrating water quality index, GIS and multivariate statistical techniques towards a better understanding of drinking water quality
by
Khan, Warish
,
Masood, Sarfaraz
,
Aslam, Mohammad
in
Agricultural practices
,
Alkalinity
,
Anions
2022
Groundwater is considered as an imperative component of the accessible water assets across the world. Due to urbanization, industrialization and intensive farming practices, the groundwater resources have been exposed to large-scale depletion and quality degradation. The prime objective of this study was to evaluate the groundwater quality for drinking purposes in Mewat district of Haryana, India. For this purpose, twenty-five groundwater samples were collected from hand pumps and tube wells spread over the entire district. Samples were analyzed for pH, electrical conductivity (EC), total dissolved solids (TDS), total hardness (TH), turbidity, total alkalinity (TA), cations and anions in the laboratory using the standard methods. Two different water quality indices (weighted arithmetic water quality index and entropy weighted water quality index) were computed to characterize the groundwater quality of the study area. Ordinary Kriging technique was applied to generate spatial distribution map of the WQIs. Four semivariogram models, i.e. circular, spherical, exponential and Gaussian were used and found to be the best fit for analyzing the spatial variability in terms of weighted arithmetic index (GWQI) and entropy weighted water quality index (EWQI). Hierarchical cluster analysis (HCA), principal component analysis (PCA) and discriminant analysis (DA) were applied to provide additional scientific insights into the information content of the groundwater quality data available for this study. The interpretation of WQI analysis based on GWQI and EWQI reveals that 64% of the samples belong to the “poor” to “very poor” bracket. The result for the semivariogram modeling also shows that Gaussian model obtains the best fit for both EWQI and GWQI dataset. HCA classified 25 sampling locations into three main clusters of similar groundwater characteristics. DA validated these clusters and identified a total of three significant variables (pH, EC and Cl) by adopting stepwise method. The application of PCA resulted in three factors explaining 69.81% of the total variance. These factors reveal how processes like rock water interaction, urban waste discharge and mineral dissolution affect the groundwater quality.
Journal Article