Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
LanguageLanguage
-
SubjectSubject
-
Item TypeItem Type
-
DisciplineDiscipline
-
YearFrom:-To:
-
More FiltersMore FiltersIs Peer Reviewed
Done
Filters
Reset
36
result(s) for
"Zaretzki, Russell"
Sort by:
World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data
by
Valiev Marat
,
Ma, Yuxing
,
Kennard, David
in
Control data (computers)
,
Evaluation
,
Open source software
2021
Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are the tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flow? To answer such questions we: a) create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of the FLOSS ecosystems and b) provide capabilities to efficiently correct, augment, query, and analyze that data. Our current WoC implementation is capable of being updated on a monthly basis and contains over 18B Git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.
Journal Article
ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems
by
Amreen Sadika
,
Bogart, Christopher
,
Zaretzki Russell
in
Active learning
,
Algorithms
,
Aliasing
2020
An accurate determination of developer identities is important for software engineering research and practice. Without it, even simple questions such as “how many developers does a project have?” cannot be answered. The commonly used version control data from Git is full of identity errors and the existing approaches to correct these errors are difficult to validate on large scale and cannot be easily improved. We, therefore, aim to develop a scalable, highly accurate, easy to use and easy to improve approach to correct software developer identity errors. We first amalgamate developer identities from version control systems in open source software repositories and investigate the nature and prevalence of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using a collection of over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with three behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We then compare these predictions with a competing commercially available effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational ids, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing research and commercial methods and the active learning approach to be an effective way to create validated datasets. Results also indicate that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that involve developer identities.
Journal Article
State-Level Geographic Disparities and Temporal Patterns in Milk Somatic Cell Counts Across the United States, 2011–2023
by
Zaretzki, Russell
,
Vidlund, Jessica
,
Odoi, Agricola
in
Agriculture
,
Clustering
,
Dairy Herd Improvement Association (DHIA)
2025
The dairy industry faces challenges from mastitis, which affects milk quality. Somatic Cell Counts (SCCs) are key indicators of udder health, subclinical mastitis presence, and legal thresholds. However, limited research has explored geographic disparities and temporal patterns in SCCs across the USA, despite their critical role in informing targeted herd management strategies, optimizing policy interventions, and ensuring consistent milk quality standards. This study aimed to examine temporal trends and geographic disparities in median weighted SCCs (mwSCCs) across USA states. This study analyzes SCC data using records from the Dairy Herd Improvement Association across 42–45 states between 2011 and 2023. State-level differences in mwSCCs were examined, with temporal changes assessed using percent differences between 2011 and 2023. Moran’s I and Local Indicators of Spatial Association (LISA) were used to identify spatial clusters of states with high and low mwSCCs. The mwSCCs decreased by 24.8%, from 234,000 cells/mL in 2011 to 176,000 cells/mL in 2023, with significant reductions in Virginia and Georgia, while Tennessee and South Carolina had minimal declines. However, Texas, California, and Colorado saw increases, with Colorado rising by 147.9%. Spatial clustering revealed high mwSCCs in the southeast and low levels in the northeast, highlighting the need for region-specific strategies.
Journal Article
Unlocking a signal of introgression from codons in Lachancea kluyveri using a mutation-selection model
by
Landerer, Cedric
,
Zaretzki, Russell
,
Gilchrist, Michael A
in
Amino acids
,
Bayesian analysis
,
Bias
2020
Background For decades, codon usage has been used as a measure of adaptation for translational efficiency and translation accuracy of a gene’s coding sequence. These patterns of codon usage reflect both the selective and mutational environment in which the coding sequences evolved. Over this same period, gene transfer between lineages has become widely recognized as an important biological phenomenon. Nevertheless, most studies of codon usage implicitly assume that all genes within a genome evolved under the same selective and mutational environment, an assumption violated when introgression occurs. In order to better understand the effects of introgression on codon usage patterns and vice versa, we examine the patterns of codon usage in Lachancea kluyveri, a yeast which has experienced a large introgression. We quantify the effects of mutation bias and selection for translation efficiency on the codon usage pattern of the endogenous and introgressed exogenous genes using a Bayesian mixture model, ROC SEMPPR, which is built on mechanistic assumptions about protein synthesis and grounded in population genetics. Results We find substantial differences in codon usage between the endogenous and exogenous genes, and show that these differences can be largely attributed to differences in mutation bias favoring A/T ending codons in the endogenous genes while favoring C/G ending codons in the exogenous genes. Recognizing the two different signatures of mutation bias and selection improves our ability to predict protein synthesis rate by 42% and allowed us to accurately assess the decaying signal of endogenous codon mutation and preferences. In addition, using our estimates of mutation bias and selection, we identify Eremothecium gossypii as the closest relative to the exogenous genes, providing an alternative hypothesis about the origin of the exogenous genes, estimate that the introgression occurred ∼6×108 generation ago, and estimate its historic and current selection against mismatched codon usage. Conclusions Our work illustrates how mechanistic, population genetic models like ROC SEMPPR can separate the effects of mutation and selection on codon usage and provide quantitative estimates from sequence data.
Journal Article
Oral microbiome and mycobiome dynamics in cancer therapy-induced oral mucositis
by
Panella, Timothy J.
,
Shope, Grace A.
,
Nodit, Laurentia
in
631/67/1665
,
692/699/67/1665
,
Bacteria
2025
Cancer therapy-induced oral mucositis is a frequent major oncological problem, secondary to cytotoxicity of chemo-radiation treatment. Oral mucositis commonly occurs 7–10 days after initiation of therapy; it is a dose-limiting side effect causing significant pain, eating difficulty, need for parenteral nutrition and a rise of infections. The pathobiology derives from complex interactions between the epithelial component, inflammation, and the oral microbiome. Our longitudinal study analysed the dynamics of the oral microbiome (bacteria and fungi) in nineteen patients undergoing chemo-radiation therapy for oral and oropharyngeal squamous cell carcinoma as compared to healthy volunteers. The microbiome was characterized in multiple oral sample types using rRNA and ITS sequence amplicons and followed the treatment regimens. Microbial taxonomic diversity and relative abundance may be correlated with disease state, type of treatment and responses. Identification of microbial-host interactions could lead to further therapeutic interventions of mucositis to re-establish normal flora and promote patients’ health. Data presented here could enhance, complement and diversify other studies that link microbiomes to oral disease, prophylactics, treatments, and outcome.
Journal Article
Predicting and Correlating the Strength Properties of Wood Composite Process Parameters by Use of Boosted Regression Tree Models
2015
Predictive boosted regression tree (BRT) models were developed to predict modulus of rupture (MOR) and internal bond (IB) for a US particleboard manufacturer. The temporal process data consisted of 4,307 records and spanned the time frame from March 2009 to June 2010. This study builds on previous published research by developing BRT models across all product types of MOR and IB produced by the particleboard manufacturer. A total of 189 continuous variables from the process line were used as possible predictor variables. BRT model comparisons were made using the root mean squared error for prediction (RMSEP) and the RMSEP relative to the mean of the response variable as a percent (RMSEP%) for the validation data sets. For MOR, RMSEP values ranged from 1.051 to 1.443 MPa, and RMSEP% values ranged from 8.5 to 11.6 percent. For IB, RMSEP values ranged from 0.074 to 0.108 MPa, and RMSEP% values ranged from 12.7 to 18.6 percent. BRT models for MOR and IB predicted better than respective regression tree models without boosting. For MOR, key predictors in the BRT models were related to “pressing temperature zones,” “thickness of pressing,” and “pressing pressure.” For IB, key predictors in the BRT models were related to “thickness of pressing.” The BRT predictive models offer manufacturers an opportunity to improve the understanding of processes and be more predictive in the outcomes of product quality attributes. This may help manufacturers reduce rework and scrap and also improve production efficiencies by avoiding unnecessarily high operating targets.
Journal Article
Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic Tests
by
Briggs, William M.
,
Zaretzki, Russell
in
Algorithms
,
Decision making
,
Decision Support Systems, Clinical
2008
We introduce the Skill Plot, a method that it is directly relevant to a decision maker who must use a diagnostic test. In contrast to ROC curves, the skill curve allows easy graphical inspection of the optimal cutoff or decision rule for a diagnostic test. The skill curve and test also determine whether diagnoses based on this cutoff improve upon a naive forecast (of always present or of always absent). The skill measure makes it easy to directly compare the predictive utility of two different classifiers in an analogy to the area under the curve statistic related to ROC analysis. Finally, this article shows that the skill-based cutoff inferred from the plot is equivalent to the cutoff indicated by optimizing the posterior odds in accordance with Bayesian decision theory. A method for constructing a confidence interval for this optimal point is presented and briefly discussed.
Journal Article
Bias correction and Bayesian analysis of aggregate counts in SAGE libraries
by
Briggs, William M
,
Gilchrist, Michael A
,
Armagan, Artin
in
Algorithms
,
Analysis
,
Bayes Theorem
2010
Background
Tag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.
Results
Three new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.
Conclusions
Several Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.
Journal Article
Model selection via adaptive shrinkage with t priors
by
Zaretzki, Russell L.
,
Armagan, Artin
in
Accuracy
,
Algorithms
,
Economic Theory/Quantitative Economics/Mathematical Methods
2010
We discuss a model selection procedure,
the adaptive ridge selector
, derived from a hierarchical Bayes argument, which results in a simple and efficient fitting algorithm. The hierarchical model utilized resembles an un-replicated variance components model and leads to weighting of the covariates. We discuss the intuition behind this type estimator and investigate its behavior as a regularized least squares procedure. While related alternatives were recently exploited to simultaneously fit and select variablses/features in regression models (Tipping in J Mach Learn Res 1:211–244, 2001; Figueiredo in IEEE Trans Pattern Anal Mach Intell 25:1150–1159, 2003), the extension presented here shows considerable improvement in model selection accuracy in several important cases. We also compare this estimator’s model selection performance to those offered by the lasso and adaptive lasso solution paths. Under randomized experimentation, we show that a fixed choice of tuning parameter leads to results in terms of model selection accuracy which are superior to the entire solution paths of lasso and adaptive lasso when the underlying model is a sparse one. We provide a robust version of the algorithm which is suitable in cases where outliers may exist.
Journal Article
Competing Against the Unknown: The Impact of Enabling and Constraining Institutions on the Informal Economy
by
Crook, T. Russell
,
Zaretzki, Russell
,
Mathias, B. D.
in
Business and Management
,
Business Ethics
,
Competition
2015
In addition to facing the known competitors in the formal economy, entrepreneurs must also be concerned with rivalry emanating from the informal economy. The informal economy is characterized by actions outside the normal scope of commerce, such as unsanctioned payments and gift-giving, as means of influencing competition. Scholars and policy makers alike have an interest in mitigating the impacts of such informal activity in that it might present an obstacle for legitimate commerce. Received theory suggests that country institutions can enable and constrain productive activity, and, in doing so, influence competitive obstacles in a country. Leveraging 13,670 responses from entrepreneurs distributed across 59 countries, we provide evidence that two particular types of enabling institutions, countries' property rights regulations and cooperative actions, are useful for lowering the obstacles presented by informal activity. We also find evidence that two constraining institutions, economic and financial regulations lead to more obstacles presented by informal activity. We describe implications for entrepreneurs, policy makers, and future researchers stemming from these findings.
Journal Article