Asset Details

MbrlCatalogueTitleDetail

Dissertation

Transformer-Enhanced Text Classification in Cybersecurity: GPT-Augmented Synthetic Data Generation, BERT-Based Semantic Encoding, and Multiclass Analysis

Houston, Robert A

2024

Overview

This research presents an examination and findings based on an investigation into the use of Large Language Models (LLMs) for generative and encoding tasks in the domain of cybersecurity. The work presented herein demonstrates some critical uses of transformer-derived deep learning model architectures that use Generative Pretrained Transformer Model 2 (GPT-2) to generate synthetic data to balance otherwise imbalanced data sets for multi-class classification for enhancing cybersecurity and families of Bidirectional Encoder Representations from Transformers (BERT) used for word embeddings. Three factors link these research areas. First, the BERT embeddings are derived from a synthetically balanced dataset produced by GPT-2 generative capabilities. Second, both BERT and GPT-2 models are pretrained models that are subsequently fine-tuned using each minority class of the data set. Finally, BERT and GPT-2 are derived from the transformers model introduced by the seminal paper, Attention Is All You Need (Vaswani et al., 2017). In this research, synthetic data sets and the embedding models produced using transformer models are evaluated using various traditional Machine Learning (ML) models using a novel weighted aggregation of F1 score developed to account for the disparate risk inherent with different classification classes for cybersecurity applications.Another aspect of this research centers on using publicly licensed models with open-source base-training weights. The objective here is to analyze lightweight LLMs that are both deployable in computationally constrained environments and available for use in secure environments with no need to pass data to public cloud enclaves via commercial API calls to monetized foundational models.This research addresses essential needs relevant to using AI models in cybersecurity applications. The primary goals address the need for balanced, high-quality training data sets and semantically aware Natural Language Processing (NLP) vectorization methods for ML models. As a backdrop, there is an emphasis on the use of lightweight democratized models that are not monetized and are widely publicly available with open-source model weights.

Share this book

Add to My Shelf

Publisher

ProQuest Dissertations & Theses

Subject

Artificial intelligence

/ Computer Engineering

/ Engineering

ISBN

9798381111774