Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
809
result(s) for
"code similarity"
Sort by:
Learning Human-Written Commit Messages to Document Code Changes
by
Zhou, Hao-Jie
,
Jia, Nan
,
Zheng, Zi-Bin
in
Artificial Intelligence
,
Computer Science
,
Control systems
2020
Commit messages are important complementary information used in understanding code changes. To address message scarcity, some work is proposed for automatically generating commit messages. However, most of these approaches focus on generating summary of the changed software entities at the superficial level, without considering the intent behind the code changes (e.g., the existing approaches cannot generate such message: “fixing null pointer exception”). Considering developers often describe the intent behind the code change when writing the messages, we propose ChangeDoc, an approach to reuse existing messages in version control systems for automatical commit message generation. Our approach includes syntax, semantic, pre-syntax, and pre-semantic similarities. For a given commit without messages, it is able to discover its most similar past commit from a large commit repository, and recommend its message as the message of the given commit. Our repository contains half a million commits that were collected from SourceForge. We evaluate our approach on the commits from 10 projects. The results show that 21.5% of the recommended messages by ChangeDoc can be directly used without modification, and 62.8% require minor modifications. In order to evaluate the quality of the commit messages recommended by ChangeDoc, we performed two empirical studies involving a total of 40 participants (10 professional developers and 30 students). The results indicate that the recommended messages are very good approximations of the ones written by developers and often include important intent information that is not included in the messages generated by other tools.
Journal Article
Code Similarity Prediction Model for Industrial Management Features Based on Graph Neural Networks
2024
The code of industrial management software typically features few system API calls and a high number of customized variables and structures. This makes the similarity of such codes difficult to compute using text features or traditional neural network methods. In this paper, we propose an FSPS-GNN model, which is based on graph neural networks (GNNs), to address this problem. The model categorizes code features into two types, outer graph and inner graph, and conducts training and prediction with four stages—feature embedding, feature enhancement, feature fusion, and similarity prediction. Moreover, differently structured GNNs were used in the embedding and enhancement stages, respectively, to increase the interaction of code features. Experiments with code from three open-source projects demonstrate that the model achieves an average precision of 87.57% and an F0.5 Score of 89.12%. Compared to existing similarity-computation models based on GNNs, this model exhibits a Mean Squared Error (MSE) that is approximately 0.0041 to 0.0266 lower and an F0.5 Score that is 3.3259% to 6.4392% higher. It broadens the application scope of GNNs and offers additional insights for the study of code-similarity issues.
Journal Article
Deep learning based-approach for quick response code verification
by
Xuan Viet, Truong
,
Hoang Viet, Tran
,
Vinh Loc, Cu
in
Artificial Intelligence
,
Artificial neural networks
,
Coding
2023
Quick response (QR) code-based traceability is considered as a smart solution to know details about the origin of products, from production to transportation and preservation before reaching customers. However, the QR code is easily copied and forged. Thus, we propose a new approach to protect this code from tampering. The approach consists of two main phases like hiding a security feature in the QR code, and estimating the similarity between the QR code affixed on the product and the genuine ones. For the former issue, the secret feature is encoded and decoded by using error correcting code for controlling errors in noisy communication channels. Hiding and extracting the encoded information in the QR code are conducted by utilizing a deep neural network in which the proposed network produces a watermarked QR code image with good quality and high tolerance to noises. The network is capable of robustness against real distortions caused by the process of printing and photograph. For the later issue, we develop neural networks based upon the architecture of Siamese network to measure the similarity of QR codes. The secret feature extracted from the obtained QR code and the result of QR code similarity estimation are combined to determine whether a QR code is genuine or fake. The proposed approach gives a competitive performance, with an average accuracy of 98%, and it has been applied to QR code authentication in practice.
Journal Article
IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block Relations
by
Xie, Wei
,
Luo, Zhenhao
,
Zhou, Xu
in
binary code similarity detection
,
Codes
,
Computational linguistics
2023
Binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of the Internet of Things (IoT), there are many binaries from different instruction architecture sets, which require BCSD approaches robust against different architectures. In this study, we propose a novel IoT-oriented binary code similarity detection approach. Our approach leverages a customized transformer-based language model with disentangled attention to capture relative position information. To mitigate out-of-vocabulary (OOV) challenges in the language model, we introduce a base-token prediction pre-training task aimed at capturing basic semantics for unseen tokens. During function embedding generation, we integrate directed jumps, data dependency, and address adjacency to capture multiple block relations. We then assign different weights to different relations and use multi-layer Graph Convolutional Networks (GCN) to generate function embeddings. We implemented the prototype of IoTSim. Our experimental results show that our proposed block relation matrix improves IoTSim with large margins. With a pool size of 103, IoTSim achieves a recall@1 of 0.903 across architectures, outperforming the state-of-the-art approaches Trex, SAFE, and PalmTree.
Journal Article
Exploring the Boundaries Between LLM Code Clone Detection and Code Similarity Assessment on Human and AI-Generated Code
2025
As Large Language Models (LLMs) continue to advance, their capabilities in code clone detection have garnered significant attention. While much research has assessed LLM performance on human-generated code, the proliferation of LLM-generated code raises critical questions about their ability to detect clones across both human- and LLM-created codebases, as this capability remains largely unexplored. This paper addresses this gap by evaluating two versions of LLaMA3 on these distinct types of datasets. Additionally, we perform a deeper analysis beyond simple prompting, examining the nuanced relationship between code cloning and code similarity that LLMs infer. We further explore how fine-tuning impacts LLM performance in clone detection, offering new insights into the interplay between code clones and similarity in human versus AI-generated code. Our findings reveal that LLaMA models excel in detecting syntactic clones but face challenges with semantic clones. Notably, the models perform better on LLM-generated datasets for semantic clones, suggesting a potential bias. The fine-tuning technique enhances the ability of LLMs to comprehend code semantics, improving their performance in both code clone detection and code similarity assessment. Our results offer valuable insights into the effectiveness and characteristics of LLMs in clone detection and code similarity assessment, providing a foundation for future applications and guiding further research in this area.
Journal Article
Binary Code Similarity Detection: Retrospective Review and Future Directions
by
Chang, Shengjia
,
Cui, Baojiang
,
Feng, Shaocong
in
Artificial intelligence
,
Binary codes
,
Codes
2025
Binary Code Similarity Detection (BCSD) is vital for vulnerability discovery, malware detection, and software security, especially when source code is unavailable. Yet, it faces challenges from semantic loss, recompilation variations, and obfuscation. Recent advances in artificial intelligence—particularly natural language processing (NLP), graph representation learning (GRL), and large language models (LLMs)—have markedly improved accuracy, enabling better recognition of code variants and deeper semantic understanding. This paper presents a comprehensive review of 82 studies published between 1975 and 2025, systematically tracing the historical evolution of BCSD and analyzing the progressive incorporation of artificial intelligence (AI) techniques. Particular emphasis is placed on the role of LLMs, which have recently emerged as transformative tools in advancing semantic representation and enhancing detection performance. The review is organized around five central research questions: (1) the chronological development and milestones of BCSD; (2) the construction of AI-driven technical roadmaps that chart methodological transitions; (3) the design and implementation of general analytical workflows for binary code analysis; (4) the applicability, strengths, and limitations of LLMs in capturing semantic and structural features of binary code; and (5) the persistent challenges and promising directions for future investigation. By synthesizing insights across these dimensions, the study demonstrates how LLMs reshape the landscape of binary code analysis, offering unprecedented opportunities to improve accuracy, scalability, and adaptability in real-world scenarios. This review not only bridges a critical gap in the existing literature but also provides a forward-looking perspective, serving as a valuable reference for researchers and practitioners aiming to advance AI-powered BCSD methodologies and applications.
Journal Article
A real noise resistance for anti-tampering quick response code
by
Viet, Tran Hoang
,
Thao, Le Hoang
,
Viet, Nguyen Hoang
in
Artificial Intelligence
,
Artificial neural networks
,
Computational Biology/Bioinformatics
2024
Traceability via quick response (QR) codes is regarded as a clever way to learn specifics about a product’s history, from its creation to its transit and preservation before reaching consumers. The QR code can, however, be easily copied and faked. Therefore, we suggest a novel strategy to prevent tampering with this code. The method is divided into two primary phases: concealing a security element in the QR code and determining how similar the QR code on the goods is to the real ones. For the first problem, error-correcting coding is used to encode and decode the secret feature in order to manage faults in noisy communication channels. A deep neural network is used to both conceal and extract the information encoded in a QR code, and the suggested network creates watermarked QR code images with good quality and noise tolerance. The network has the ability to be resilient to actual distortions brought on by the printing and photographing processes. In order to measure the similarity of QR codes, we create neural networks based on the Siamese network design. To assess whether a QR code is real or fraudulent, the hidden characteristic extracted from the acquired QR code and the outcome of QR code similarity estimation are merged. With an average accuracy of 98%, the proposed technique performs competitively and has been used in practice for QR code authentication.
Journal Article
Gradient-Guided Assembly Instruction Relocation for Adversarial Attacks Against Binary Code Similarity Detection
2026
Transformer-based models have significantly advanced binary code similarity detection (BCSD) by leveraging their semantic encoding capabilities for efficient function matching across diverse compilation settings. Although adversarial examples can strategically undermine the accuracy of BCSD models and protect critical code, existing techniques predominantly depend on inserting artificial instructions, which incur high computational costs and offer limited diversity of perturbations. To address these limitations, we propose AIMA, a novel gradient-guided assembly instruction relocation method. Our method decouples the detection model into tokenization, embedding, and encoding layers to enable efficient gradient computation. Since token IDs of instructions are discrete and non-differentiable, we compute gradients in the continuous embedding space to evaluate the influence of each token. The most critical tokens are identified by calculating the norm of their embedding gradients. We then establish a mapping between instructions and their corresponding tokens to aggregate token-level importance into instruction-level significance. To maximize adversarial impact, a sliding window algorithm selects the most influential contiguous segments for relocation, ensuring optimal perturbation with minimal length. This approach efficiently locates critical code regions without expensive search operations. The selected segments are relocated outside their original function boundaries via a jump mechanism, which preserves runtime control flow and functionality while introducing “deletion” effects in the static instruction sequence. Extensive experiments show that AIMA reduces similarity scores by up to 35.8% in state-of-the-art BCSD models. When incorporated into training data, it also enhances model robustness, achieving a 5.9% improvement in AUROC.
Journal Article
Software system comparison with semantic source code embeddings
by
Sašo, Karakatič
,
Miloševič Aleksej
,
Tjaša, Heričko
in
Libraries
,
Metric space
,
Neural networks
2022
This paper presents a novel approach for comparing software systems by calculating the robust Hausdorff distance between semantic source code embeddings of individual software components, i.e., methods. The proposed approach represents each software as a set of vectors, where every vector is a semantic source code embedding of a particular method. The code embeddings are constructed from abstract syntax trees of the methods with the help of attention-based neural network models that capture the semantics of the methods. Previous research has shown that comparing semantic source code embeddings can reveal semantic relationships between the two methods. We utilize this characteristic to estimate the semantic similarity between the two software systems by computing the robust Hausdorff distance. In the experiment, a pre-trained code2vec neural network model is used to create the source code vector representations of several open-source Java-based libraries. Several variations of the robust Hausdorff distance are evaluated. The results show that the proposed approach can effectively estimate the semantic similarity, reflecting the software library’s scopes, software evolution, and individual parts (e.g., packages) of those libraries.
Journal Article
Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection
2023
Binary code similarity detection is used to calculate the code similarity of a pair of binary functions or files, through a certain calculation method and judgment method. It is a fundamental task in the field of computer binary security. Traditional methods of similarity detection usually use graph matching algorithms, but these methods have poor performance and unsatisfactory effects. Recently, graph neural networks have become an effective method for analyzing graph embeddings in natural language processing. Although these methods are effective, the existing methods still do not sufficiently learn the information of the binary code. To solve this problem, we propose Codeformer, an iterative model of a graph neural network (GNN)-nested Transformer. The model uses a Transformer to obtain an embedding vector of the basic block and uses the GNN to update the embedding vector of each basic block of the control flow graph (CFG). Codeformer iteratively executes basic block embedding to learn abundant global information and finally uses the GNN to aggregate all the basic blocks of a function. We conducted experiments on the OpenSSL, Clamav and Curl datasets. The evaluation results show that our method outperforms the state-of-the-art models.
Journal Article