Catalogue Search | MBRL

Source code analytics with Roslyn and JavaScript data visualization

by Mukerjee, Sudipta, author in Source code (Computer science) , Code generators. , Compilers (Computer programs)

Book

Share this book

Add to My Shelf

Commit2Vec: Learning Distributed Representations of Code Changes

by Cabrera Lozoya, Rocío , Sabetta, Antonino , Bezzi, Michele in Classification , Computer Imaging , Computer Science

2021

Deep learning methods have found successful applications in fields like image classification and natural language processing. They have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach for source code representation, which uses information about its syntactic structure, and we extend it to represent source code changes (i.e., commits). We use this representation to tackle an industrial-relevant task: the classification of security-relevant commits. We leverage on transfer learning, a machine learning technique which reuses, or transfers, information learned from previous tasks (commonly called pretext tasks) to tackle a new target task. We assess the impact of using two different pretext tasks, for which abundant labeled data is available, to tackle the classification of security-relevant commits. Our results indicate that representations that exploit the structural information in code syntax outperform token-based representations. Furthermore, we show that pre-training on a small dataset ( > 10 4 samples), but for a pretext task that is closely related to the target task, results in better performance metrics than pre-training on a loosely related pretext task with a very large dataset ( > 10 6 samples).

Journal Article

Share this book

Add to My Shelf

A Bidirectional LSTM Language Model for Code Evaluation and Repair

by Rahman, Md. Mostafizer , Watanobe, Yutaka , Nakamura, Keita in Application programming interface , Automation , Computer aided software engineering

2021

Programming is a vital skill in computer science and engineering-related disciplines. However, developing source code is an error-prone task. Logical errors in code are particularly hard to identify for both students and professionals, and a single error is unexpected to end-users. At present, conventional compilers have difficulty identifying many of the errors (especially logical errors) that can occur in code. To mitigate this problem, we propose a language model for evaluating source codes using a bidirectional long short-term memory (BiLSTM) neural network. We trained the BiLSTM model with a large number of source codes with tuning various hyperparameters. We then used the model to evaluate incorrect code and assessed the model’s performance in three principal areas: source code error detection, suggestions for incorrect code repair, and erroneous code classification. Experimental results showed that the proposed BiLSTM model achieved 50.88% correctness in identifying errors and providing suggestions. Moreover, the model achieved an F-score of approximately 97%, outperforming other state-of-the-art models (recurrent neural networks (RNNs) and long short-term memory (LSTM)).

Journal Article

Share this book

Add to My Shelf

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

by Cheers Hayden , Smith, Shamus P , Lin, Yuqing in Plagiarism , Robustness , Source code

2021

Source code plagiarism is a common occurrence in undergraduate computer science education. In order to identify such cases, many source code plagiarism detection tools have been proposed. A source code plagiarism detection tool evaluates pairs of assignment submissions to detect indications of plagiarism. However, a plagiarising student will commonly apply plagiarism-hiding modifications to source code in an attempt to evade detection. Subsequently, prior work has implied that currently available source code plagiarism detection tools are not robust to the application of pervasive plagiarism-hiding modifications. In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications. The tools are evaluated with data sets of simulated undergraduate plagiarism, constructed with source code modifications representative of undergraduate students. The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure. Of the evaluated tools, JPlag and Plaggie demonstrates the greatest robustness to different types of plagiarism-hiding modifications. However, the results also indicate that graph-based tools, specifically those that compare programs as program dependence graphs, show potentially greater robustness to pervasive plagiarism-hiding modifications.

Journal Article

Share this book

Add to My Shelf

A new approach to software vulnerability detection based on CPG analysis

by Xuan, Cho Do in Deep Graph Convolutional Neural Network , feature profile , source code features

2023

Detecting source code vulnerabilities is an essential issue today. In this paper, to improve the efficiency of detecting vulnerabilities in software written in C/C++, we propose to use a combination of Deep Graph Convolutional Neural Network (DGCNN) and code property graph (CPG). Specifically, 3 main proposed phases in the research method include: phase 1: building feature profiles of source code. At this step, we suggest using analysis techniques such as Word2vec, one hot encoding to standardize and analyze the source code; phase 2: extracting features of source code based on feature profiles. Accordingly, at this phase, we propose to use Deep Graph Convolutional Neural Network (DGCNN) model to analyze and extract features of the source code; phase 3: classifying source code based on the features extracted in phase 2 to find normal source code and source code containing security vulnerabilities. Some scenarios for comparing and evaluating the proposed method in this study compared with other approaches we have taken show the superior effectiveness of our approach. Besides, this result proves that our method in this paper is not only correct and reasonable, but it also opens up a new approach to the task of detecting source code vulnerabilities.

Journal Article

Share this book

Add to My Shelf

A novel approach for software vulnerability detection based on intelligent cognitive computing

by Mai, Dao Hoang , Thanh, Ma Cong , Do Xuan, Cho in Algorithms , Classification , Compilers

2023

Improving and enhancing the effectiveness of software vulnerability detection methods is urgently needed today. In this study, we propose a new source code vulnerability detection method based on intelligent and advanced computational algorithms. It's a combination of four main processing techniques including (i) Source Embedding, (ii) Feature Learning, (iii) Resampling Data, and (iv) Classification. The Source Embedding method will perform the task of analyzing and standardizing the source code based on the Joern tool and the data mining algorithm. The Feature Learning model has the function of aggregating and extracting source code attribute based on node using machine learning and deep learning methods. The Resampling Data technique will perform equalization of the experimental dataset. Finally, the Classification model has the function of detecting source code vulnerabilities. The novelty and uniqueness of the new intelligent cognitive computing method is the combination and synchronous use of many different data extracting techniques to compute, represent, and extract the properties of the source code. With this new calculation method, many significant unusual properties and features of the vulnerability have been synthesized and extracted. To prove the superiority of the proposed method, we experiment to detect source code vulnerabilities based on the Verum dataset, details of this part are presented in the experimental section. The experimental results show that the method proposed in the paper has brought good results on all measures. These results have shown to be the best research results for the source code vulnerability detection task using the Verum dataset according to our survey to date. With such results, the proposal in this study is not only meaningful in terms of science but also in practical terms when the method of using intelligent cognitive computing techniques to analyze and evaluate source code has helped to improve the efficiency of the source code analysis and vulnerability detection process.

Journal Article

Share this book

Add to My Shelf

Automatic identification of self-admitted technical debt from four different sources

by Soliman, Mohamed , Li, Yikun , Avgeriou, Paris in Artificial intelligence , Datasets , Debt management

2023

Technical debt refers to taking shortcuts to achieve short-term goals while sacrificing the long-term maintainability and evolvability of software systems. A large part of technical debt is explicitly reported by the developers themselves; this is commonly referred to as Self-Admitted Technical Debt or SATD. Previous work has focused on identifying SATD from source code comments and issue trackers. However, there are no approaches available for automatically identifying SATD from other sources such as commit messages and pull requests, or by combining multiple sources. Therefore, we propose and evaluate an approach for automated SATD identification that integrates four sources: source code comments, commit messages, pull requests, and issue tracking systems. Our findings show that our approach outperforms baseline approaches and achieves an average F1-score of 0.611 when detecting four types of SATD (i.e., code/design debt, requirement debt, documentation debt, and test debt) from the four aforementioned sources. Thereafter, we analyze 23.6M code comments, 1.3M commit messages, 3.7M issue sections, and 1.7M pull request sections to characterize SATD in 103 open-source projects. Furthermore, we investigate the SATD keywords and relations between SATD in different sources. The findings indicate, among others, that: 1) SATD is evenly spread among all sources; 2) issues and pull requests are the two most similar sources regarding the number of shared SATD keywords, followed by commit messages, and then followed by code comments; 3) there are four kinds of relations between SATD items in the different sources.

Journal Article

Share this book

Add to My Shelf

A Survey of Automatic Source Code Summarization

by Tang, Ke , Zhang, Chunyan , Zhou, Qinglei in Algorithms , Artificial intelligence , Codes

2022

Source code summarization refers to the natural language description of the source code’s function. It can help developers easily understand the semantics of the source code. We can think of the source code and the corresponding summarization as being symmetric. However, the existing source code summarization is mismatched with the source code, missing, or out of date. Manual source code summarization is inefficient and requires a lot of human efforts. To overcome such situations, many studies have been conducted on Automatic Source Code Summarization (ASCS). Given a set of source code, the ASCS techniques can automatically generate a summary described with natural language. In this paper, we give a review of the development of ASCS technology. Almost all ASCS technology involves the following stages: source code modeling, code summarization generation, and quality evaluation. We further categorize the existing ASCS techniques based on the above stages and analyze their advantages and shortcomings. We also draw a clear map on the development of the existing algorithms.

Journal Article

Share this book

Add to My Shelf

Distilled GPT for source code summarization

by McMillan, Collin , Su, Chia-Yi in Artificial Intelligence , Automation , Chatbots

2024

A code summary is a brief natural language description of source code. Summaries are usually only a single sentence long, and yet form the backbone of developer documentation. A short descriptions such as “changes all visible polygons to the color blue” can give a programmer a high-level idea of what code does without the effort of reading the code itself. Recently, products based on Large Language Models such as ChatGPT have demonstrated a strong ability to write these descriptions automatically. However, to use these tools, programmers must send their code to untrusted third parties for processing (e.g., via an API call). This loss of custody is not acceptable to many organizations. In this paper, we present an alternative: we train an open source model using sample output generated by GPT - 3.5 in a process related to knowledge distillation. Our model is small enough (350 m parameters) to be run on a single 16gb GPU, yet we show in our evaluation that it is large enough to mimic GPT - 3.5 on this task.

Journal Article

Share this book

Add to My Shelf

Modeling source code in bimodal for program comprehension

by Jiang, He , Wen, Dongzhen , Diao, Yufeng in Artificial Intelligence , Building codes , Computational Biology/Bioinformatics

2024

Source code is an intermediary through which humans communicate with computer systems. It contains a large amount of domain knowledge which can be learned by statistical models. Furthermore, this knowledge can be used to build software engineering tools. We find that the functionality of the source code depends on the programming language-specific token which build the base structure, while identifiers provide natural language information. On this basis, we found that the knowledge in the source code can be sufficiently learned more when modeling the source code in bimodal. This paper presents the bimodal composition language model (BCLM) for source code modeling and representation. We analyze the effectiveness of bimodal modeling, and the results show that the bimodal approach has great potential for source code modeling and program comprehension.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter