Catalogue Search | MBRL

Understanding large-scale software systems – structure and flows

by Levy, Omer , Feitelson, Dror G in Programming languages , Software development , Source code

2021

Program comprehension accounts for a large portion of software development costs and effort. The academic literature contains mainly research on program comprehension of short code snippets, but comprehension at the system level is no less important. We claim that comprehending a software system is a distinct activity that differs from code comprehension. We interviewed experienced developers, architects, and managers in the software industry and open-source community, to uncover the meaning of program comprehension at the system level; later we conducted a survey to verify the findings. The interviews demonstrate, among other things, that system comprehension is largely detached from code and programming language, and includes scope that is not captured in the code. It focuses on one hand on the structure of the system, and on the other hand on the flows in the system, but less on the code itself. System comprehension is a continuous, unending, iterative process, which utilizes white-box and black-box approaches at different layers of the system depending on needs, and combines both bottom-up and top-down comprehension strategies. In summary, comprehending a system is not just comprehending the code at a larger scale, and it is not possible to comprehend large systems at the same level as comprehending code.

Journal Article

Share this book

Add to My Shelf

Deep code comment generation with hybrid lexical and syntactical information

by Lo, David , Li, Ge , Hu, Xing in Artificial neural networks , Information retrieval , Machine translation

2020

During software maintenance, developers spend a lot of time understanding the source code. Existing studies show that code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named Hybrid-DeepCom to automatically generate code comments for the functional units of Java language, namely, Java methods. The generated comments aim to help developers understand the functionality of Java methods. Hybrid-DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. It formulates the comment generation task as the machine translation problem. Hybrid-DeepCom exploits a deep neural network that combines the lexical and structure information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects on GitHub. We evaluate the experimental results on both machine translation metrics and information retrieval metrics. Experimental results demonstrate that our method Hybrid-DeepCom outperforms the state-of-the-art by a substantial margin. In addition, we evaluate the influence of out-of-vocabulary tokens on comment generation. The results show that reducing the out-of-vocabulary tokens improves the accuracy effectively.

Journal Article

Share this book

Add to My Shelf

A practical guide on conducting eye tracking studies in software engineering

by Sharif Bonita , Begel, Andrew , Sharafi Zohreh in Design of experiments , Design standards , Engineering research

2020

For several years, the software engineering research community used eye trackers to study program comprehension, bug localization, pair programming, and other software engineering tasks. Eye trackers provide researchers with insights on software engineers’ cognitive processes, data that can augment those acquired through other means, such as on-line surveys and questionnaires. While there are many ways to take advantage of eye trackers, advancing their use requires defining standards for experimental design, execution, and reporting. We begin by presenting the foundations of eye tracking to provide context and perspective. Based on previous surveys of eye tracking for programming and software engineering tasks and our collective, extensive experience with eye trackers, we discuss when and why researchers should use eye trackers as well as how they should use them. We compile a list of typical use cases—real and anticipated—of eye trackers, as well as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data. We also discuss the pragmatics of eye tracking studies. Finally, we offer lessons learned about using eye trackers to study software engineering tasks. This paper is intended to be a one-stop resource for researchers interested in designing, executing, and reporting eye tracking studies of software engineering tasks.

Journal Article

Share this book

Add to My Shelf

On the effects of program slicing for vulnerability detection during code inspection

by Tuma, Katja , Massacci, Fabio , Papotti, Aurora in Access control , Algorithms , Fault location

2025

Slicing is a fault localization technique that has been proposed to support debugging and program comprehension. Yet, its empirical effectiveness during code inspection by humans has received limited attention. The goal of our study is two-fold. First, we aim to define what it means for a code reviewer to identify the vulnerable lines correctly. Second, we investigate whether reducing the number of to-be-inspected lines by method-level slicing supports code reviewers in detecting security vulnerabilities. We propose a novel approach based on the notion of a$$\\delta $$δ -neighborhood (intuitively based on the idea of the context size of the command ) to define correctly identified lines. Then, we conducted a multi-year controlled experiment (2017-2023) in which MSc students attending security courses ($$n=236$$n = 236 ) were tasked with identifying vulnerable lines in original or sliced Java files from Apache Tomcat. We provide perfect seed lines for a slicing algorithm to control for confounding factors. Each treatment differs in the pair (Vulnerability, Original/Sliced) with a balanced design with vulnerabilities from the OWASP Top 10 2017: A1 (Injection), A5 (Broken Access Control), A6 (Security Misconfiguration), and A7 (Cross-Site Scripting). To generate smaller slices for human consumption, we used a variant of intra-procedural thin slicing. We report the results for$$\\delta = 0$$δ = 0 which corresponds to exactly matching the vulnerable ground truth lines, and$$\\delta = 3$$δ = 3 which represents the scenario of identifying the vulnerable area. For both cases, we found that slicing helps in ‘finding something’ (the participant has found at least some vulnerable lines) as opposed to ‘finding nothing’. For the case of$$\\delta = 0$$δ = 0 analyzing a slice and analyzing the original file are statistically equivalent from the perspective of lines found by those who found something. With$$\\delta = 3$$δ = 3 slicing helps to find more vulnerabilities compared to analyzing an original file, as we would normally expect. Given the type of population, additional experiments are necessary to be generalized to experienced developers.

Journal Article

Share this book

Add to My Shelf

Visualising data science workflows to support third-party notebook comprehension: an empirical study

by Sarasua, Cristina , Bernstein, Abraham , Ramasamy, Dhivyabharathi in Annotations , Data science , Empirical analysis

2023

Data science is an exploratory and iterative process that often leads to complex and unstructured code. This code is usually poorly documented and, consequently, hard to understand by a third party. In this paper, we first collect empirical evidence for the non-linearity of data science code from real-world Jupyter notebooks, confirming the need for new approaches that aid in data science code interaction and comprehension. Second, we propose a visualisation method that elucidates implicit workflow information in data science code and assists data scientists in navigating the so-called garden of forking paths in non-linear code. The visualisation also provides information such as the rationale and the identification of the data science pipeline step based on cell annotations. We conducted a user experiment with data scientists to evaluate the proposed method, assessing the influence of (i) different workflow visualisations and (ii) cell annotations on code comprehension. Our results show that visualising the exploration helps the users obtain an overview of the notebook, significantly improving code comprehension. Furthermore, our qualitative analysis provides more insights into the difficulties faced during data science code comprehension.

Journal Article

Share this book

Add to My Shelf

An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks

by Ko, Amy J. , Myers, Brad A. , Aung, Htet Htet in Bars , Computer programs , Debugging

2006

Much of software developers' time is spent understanding unfamiliar code. To better understand how developers gain this understanding and how software development environments might be involved, a study was performed in which developers were given an unfamiliar program and asked to work on two debugging tasks and three enhancement tasks for 70 minutes. The study found that developers interleaved three activities. They began by searching for relevant code both manually and using search tools; however, they based their searches on limited and misrepresentative cues in the code, environment, and executing program, often leading to failed searches. When developers found relevant code, they followed its incoming and outgoing dependencies, often returning to it and navigating its other dependencies; while doing so, however, Eclipse's navigational tools caused significant overhead. Developers collected code and other information that they believed would be necessary to edit, duplicate, or otherwise refer to later by encoding it in the interactive state of Eclipse's package explorer, file tabs, and scroll bars. However, developers lost track of relevant code as these interfaces were used for other tasks, and developers were forced to find it again. These issues caused developers to spend, on average, 35 percent of their time performing the mechanics of navigation within and between source files. These observations suggest a new model of program understanding grounded in theories of information foraging and suggest ideas for tools that help developers seek, relate, and collect information in a more effective and explicit manner.

Journal Article

Share this book

Add to My Shelf

A Systematic Survey of Program Comprehension through Dynamic Analysis

by Zaidman, A. , Koschke, R. , van Deursen, A. in Analysis , Availability , Communities

2009

Program comprehension is an important activity in software maintenance, as software must be sufficiently understood before it can be properly modified. The study of a program's execution, known as dynamic analysis, has become a common technique in this respect and has received substantial attention from the research community, particularly over the last decade. These efforts have resulted in a large research body of which currently there exists no comprehensive overview. This paper reports on a systematic literature survey aimed at the identification and structuring of research on program comprehension through dynamic analysis. From a research body consisting of 4,795 articles published in 14 relevant venues between July 1999 and June 2008 and the references therein, we have systematically selected 176 articles and characterized them in terms of four main facets: activity, target, method, and evaluation. The resulting overview offers insight in what constitutes the main contributions of the field, supports the task of identifying gaps and opportunities, and has motivated our discussion of several important research directions that merit additional consideration in the near future.

Journal Article

Share this book

Add to My Shelf

A Controlled Experiment for Program Comprehension through Trace Visualization

by Cornelissen, B , van Deursen, A , Zaidman, A in Added value , Analysis , Case studies

2011

Software maintenance activities require a sufficient level of understanding of the software at hand that unfortunately is not always readily available. Execution trace visualization is a common approach in gaining this understanding, and among our own efforts in this context is Extravis, a tool for the visualization of large traces. While many such tools have been evaluated through case studies, there have been no quantitative evaluations to the present day. This paper reports on the first controlled experiment to quantitatively measure the added value of trace visualization for program comprehension. We designed eight typical tasks aimed at gaining an understanding of a representative subject system, and measured how a control group (using the Eclipse IDE) and an experimental group (using both Eclipse and Extravis) performed these tasks in terms of time spent and solution correctness. The results are statistically significant in both regards, showing a 22 percent decrease in time requirements and a 43 percent increase in correctness for the group using trace visualization.

Journal Article

Share this book

Add to My Shelf

CoCoAST: Representing Source Code via Hierarchical Splitting and Reconstruction of Abstract Syntax Trees

by Zhang, Dongmei , Wang, Yanlin , Shi, Ensheng in Deep learning , Machine learning , Neural networks

2023

Recently, machine learning techniques especially deep learning techniques have made substantial progress on some code intelligence tasks such as code summarization, code search, clone detection, etc. How to represent source code to effectively capture the syntactic, structural, and semantic information is a key challenge. Recent studies show that the information extracted from abstract syntax trees (ASTs) is conducive to code representation learning. However, existing approaches fail to fully capture the rich information in ASTs due to the large size/depth of ASTs. In this paper, we propose a novel model CoCoAST that hierarchically splits and reconstructs ASTs to comprehensively capture the syntactic and semantic information of code without the loss of AST structural information. First, we hierarchically split a large AST into a set of subtrees and utilize a recursive neural network to encode the subtrees. Then, we aggregate the embeddings of subtrees by reconstructing the split ASTs to get the representation of the complete AST. Finally, we combine AST representation carrying the syntactic and structural information and source code embedding representing the lexical information to obtain the final neural code representation. We have applied our source code representation to two common program comprehension tasks, code summarization and code search. Extensive experiments have demonstrated the superiority of CoCoAST. To facilitate reproducibility, our data and code are available https://github.com/s1530129650/CoCoAST.

Journal Article

Share this book

Add to My Shelf

A unified multi-task learning model for AST-level and token-level code completion

by Wei, Bolin , Fu, Zhiyi , Li, Ge in Constraint modelling , Learning , Neural networks

2022

Code completion, one of the most useful features in the Integrated Development Environments (IDEs), can accelerate software development by suggesting the next probable tokens based on existing code in real-time. Recent studies have shown that recurrent neural networks based statistical language models can improve the performance of code completion tools through learning from large-scale software repositories. However, most of the existing approaches treat code completion as a single generation task in which the model predicts the value of the tokens or AST nodes based on the contextual source code without considering the syntactic constraints such as the static type information. Besides, the semantic relationships in programs can be very long. Existing recurrent neural networks based language models are not sufficient to model the long-term dependency. In this paper, we tackle the aforementioned limitations by building a unified multi-task learning based code completion model for both AST-level and token-level code completion. To model the relationship and constraints between the type and value of the code elements, we adopt a multi-task learning framework to predict the type and value of the tokens (AST nodes) simultaneously. To capture the long-term dependency in the input programs, we employ a self-attentional architecture based network as the base language model. We apply our approach to both AST-level and token-level code completion. Experimental results demonstrate the effectiveness of our model when compared with state-of-the-art methods.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter