Catalogue Search | MBRL

The impact of automated feature selection techniques on the interpretation of defect models

by Tantithamthavorn Chakkrit , Treude Christoph , Jirayus, Jiarpakdee in Automation , Correlation , Defects

2020

The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

Journal Article

Share this book

Add to My Shelf

GitHub Discussions: An exploratory study of early adoption

by Treude Christoph , Hata Hideaki , Baltes, Sebastian in Collaboration , Communication channels , Open source software

2022

Discussions is a new feature of GitHub for asking questions or discussing topics outside of specific Issues or Pull Requests. Before being available to all projects in December 2020, it had been tested on selected open source software projects. To understand how developers use this novel feature, how they perceive it, and how it impacts the development processes, we conducted a mixed-methods study based on early adopters of GitHub discussions from January until July 2020. We found that: (1) errors, unexpected behavior, and code reviews are prevalent discussion categories; (2) there is a positive relationship between project member involvement and discussion frequency; (3) developers consider GitHub Discussions useful but face the problem of topic duplication between Discussions and Issues; (4) Discussions play a crucial role in advancing the development of projects; and (5) positive sentiment in Discussions is more frequent than in Stack Overflow posts. Our findings are a first step towards data-informed guidance for using GitHub Discussions, opening up avenues for future work on this novel communication channel.

Journal Article

Share this book

Add to My Shelf

Wait for it: identifying “On-Hold” self-admitted technical debt

by Rungroj, Maipradit , Treude Christoph , Hata Hideaki in Automation , Software development , Software engineering

2020

Self-admitted technical debt refers to situations where a software developer knows that their current implementation is not optimal and indicates this using a source code comment. In this work, we hypothesize that it is possible to develop automated techniques to understand a subset of these comments in more detail, and to propose tool support that can help developers manage self-admitted technical debt more effectively. Based on a qualitative study of 333 comments indicating self-admitted technical debt, we first identify one particular class of debt amenable to automated management: on-hold self-admitted technical debt (on-hold SATD), i.e., debt which contains a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere. We then design and evaluate an automated classifier which can identify these on-hold instances with an area under the receiver operating characteristic curve (AUC) of 0.98 as well as detect the specific conditions that developers are waiting for. Our work presents a first step towards automated tool support that is able to indicate when certain instances of self-admitted technical debt are ready to be addressed.

Journal Article

Share this book

Add to My Shelf

An empirical study of developers’ discussions about security challenges of different programming languages

by Xie Yongzheng , Treude Christoph , Zahedi Mansooreh in Data sources , Empirical analysis , Programming languages

2022

Given programming languages can provide different types and levels of security support, it is critically important to consider security aspects while selecting programming languages for developing software systems. Inadequate consideration of security in the choice of a programming language may lead to potential ramifications for secure development. Whilst theoretical analysis of the supposed security properties of different programming languages has been conducted, there has been relatively little effort to empirically explore the actual security challenges experienced by developers. We have performed a large-scale study of the security challenges of 15 programming languages by quantitatively and qualitatively analysing the developers’ discussions from Stack Overflow and GitHub. By leveraging topic modelling, we have derived a taxonomy of 18 major security challenges for 6 topic categories. We have also conducted comparative analysis to understand how the identified challenges vary regarding the different programming languages and data sources. Our findings suggest that the challenges and their characteristics differ substantially for different programming languages and data sources, i.e., Stack Overflow and GitHub. The findings provide evidence-based insights and understanding of security challenges related to different programming languages to software professionals (i.e., practitioners or researchers). The reported taxonomy of security challenges can assist both practitioners and researchers in better understanding and traversing the secure development landscape. This study highlights the importance of the choice of technology, e.g., programming language, in secure software engineering. Hence, the findings are expected to motivate practitioners to consider the potential impact of the choice of programming languages on software security.

Journal Article

Share this book

Add to My Shelf

Developer reactions to protestware in open source software: the cases of color.js and es5.ext

by Wang, Dong , Treude, Christoph , Hata, Hideaki in Compilers , Computer Science , Interpreters

2025

There is growing concern about maintainers self-sabotaging their work in order to take political or economic stances, a practice referred to as “protestware”. Our objective is to understand the discourse around discussions on such an attack, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases i.e., colors.js and es5-ext. Results indicate that protestware discussions are spread more quickly on the GitHub platform, while security vulnerabilities are faster on social media. By establishing a taxonomy of protestware discussions, we identify posts that express stances and provide technical mitigation instructions. We applied a thematic analysis to 684 protestware related posts to identify five major themes during the discussions: i. disseminate and response, ii. stance, iii. reputation, iv. communicative styles, v. rights and ethics. This work sheds light on the nuanced landscape of protestware discussions, offering insights for both researchers and developers into maintaining a healthy balance between the political or social actions of developers and the collective well-being of the open-source community.

Journal Article

Share this book

Add to My Shelf

Correction to: Wait for it: identifying “On-Hold” self-admitted technical debt

by Rungroj, Maipradit , Treude Christoph , Hata Hideaki

2021

A Correction to this paper has been published: https://doi.org/10.1007/s10664-021-09939-7

Journal Article

Share this book

Add to My Shelf

18 million links in commit messages: purpose, evolution, and decay

by Treude, Christoph , Hata, Hideaki , Xiao, Tao in Case studies , Data collection , Decay

2023

Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on “9.6 million links in source code comments” showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub repositories to investigate whether they suffer the same fate. Results show that referencing external resources is prevalent and that the most frequent domains other than github.com are the external domains of Stack Overflow and Google Code. Similarly, links serve as source code context to commit messages, with inaccessible links being frequent. Although repeatedly referencing links is rare (4%), 14% of links that are prone to evolve become unavailable over time; e.g., tutorials or articles and software homepages become unavailable over time. Furthermore, we find that 70% of the distinct links suffer from decay; the domains that occur the most frequently are related to Subversion repositories. We summarize that links in commits share the same fate as links in code, opening up avenues for future work.

Journal Article

Share this book

Add to My Shelf

SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

by Lo, David , Sulistya Agus , Treude Christoph in Digital media , Embedding , Performance enhancement

2020

Software developers have benefited from various sources of knowledge such as forums, question-and-answer sites, and social media platforms to help them in various tasks. Extracting software-related knowledge from different platforms involves many challenges. In this paper, we propose an approach to improve the effectiveness of knowledge extraction tasks by performing cross-platform analysis. Our approach is based on transfer representation learning and word embedding, leveraging information extracted from a source platform which contains rich domain-related content. The information extracted is then used to solve tasks in another platform (considered as target platform) with less domain-related content. We first build a word embedding model as a representation learned from the source platform, and use the model to improve the performance of knowledge extraction tasks in the target platform. We experiment with Software Engineering Stack Exchange and Stack Overflow as source platforms, and two different target platforms, i.e., Twitter and YouTube. Our experiments show that our approach improves performance of existing work for the tasks of identifying software-related tweets and helpful YouTube comments.

Journal Article

Share this book

Add to My Shelf

Detecting outdated code element references in software repository documentation

by Treude, Christoph , Tan, Wen Siang , Wagner, Markus in Compilers , Computer Science , Documentation

2024

Outdated documentation is a pervasive problem in software development, preventing effective use of software, and misleading users and developers alike. We posit that one possible reason why documentation becomes out of sync so easily is that developers are unaware of when their source code modifications render the documentation obsolete. Ensuring that the documentation is always in sync with the source code takes considerable effort, especially for large codebases. To address this situation, we propose an approach that can automatically detect code element references that survive in the documentation after all source code instances have been deleted. In this work, we analysed over 3,000 GitHub projects and found that most projects contain at least one outdated code element reference at some point in their history. We submitted GitHub issues to real-world projects containing outdated references detected by our approach, some of which have already led to documentation fixes. As an initiative toward keeping documentation in software repositories up-to-date, we have made our implementation available for developers to scan their GitHub projects for outdated code element references.

Journal Article

Share this book

Add to My Shelf

Large language models for qualitative research in software engineering: exploring opportunities and challenges

by Bano, Muneera , Zowghi, Didar , Treude, Christoph in Artificial Intelligence , Automation , Chatbots

2024

The recent surge in the integration of Large Language Models (LLMs) like ChatGPT into qualitative research in software engineering, much like in other professional domains, demands a closer inspection. This vision paper seeks to explore the opportunities of using LLMs in qualitative research to address many of its legacy challenges as well as potential new concerns and pitfalls arising from the use of LLMs. We share our vision for the evolving role of the qualitative researcher in the age of LLMs and contemplate how they may utilize LLMs at various stages of their research experience.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter