Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Document Vector Representation with Enhanced Features Based on Doc2VecC

in Algorithms / Classification / Deletion / Documents / Effectiveness / Efficiency / Methods / Natural language processing / Neural networks / Representations / Semantics / Statistical methods / Words (language)

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Document Vector Representation with Enhanced Features Based on Doc2VecC

2024

Confirm

Journal Article

Document Vector Representation with Enhanced Features Based on Doc2VecC

2024

Overview

The main purpose of document vectorization is to represent words into a series of vectors that can express the semantics of documents. Whether in Chinese or English, words are the most basic units to express text processing. The effectiveness of the natural language processing tasks is highly correlated with the document vector representation method. Document vectorization methods include statistical-based methods and neural network-based methods. However, in general, many document vectorization methods are generic methods that do not distinguish between both long and short texts as well as English and Chinese usage scenarios, thus leading to unsatisfactory document classification results. In addition to developing a PV-IDF model with enhanced features to address the issue of document feature loss caused by the Doc2VecC model using random deletion method, this paper suggests the inverse document frequency as an important indicator of candidate word deletion strategy. This will speed up model training and improve the effectiveness of document classification. From the experimental data, the PV-IDF model with enhanced features performs better for both long and short documents,as well as English and Chinese documents, and it has important advantages in terms of algorithm execution efficiency and error rate, particularly for short documents. The proposed method outperforms the Doc2VecC model in each of the five evaluation indicators that evaluate the effect of classification, with the average error rate for short document classification being 41% lower than that of the Doc2VecC model and 45.2% lower than that of the PV-DM model, respectively. Compared with the Doc2VecC model, which can only show high efficiency on small-scale data sets, the PV-IDF model can demonstrate high training efficiency on a variety of scale datasets, outperforming the comparison approach. As a result, the proposed method can provide high-quality vector representations for documents of varying length and enhance the effectiveness of related operations.

Share this book

Add to My Shelf

Publisher

Springer Nature B.V

Subject

/ Methods