Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Out-domain Chinese new word detection with statistics-based character embedding

by Zhu, Jia , Yang, Min , Yiu, S M , Liang, Yuzhi

in Artificial intelligence / Asian languages / Chinese languages / Corpus linguistics / Data quality / English language / Experiments / Japanese language / Languages / Machine learning / Methods / Neural networks / Personality / Quality / Segmentation / Short term memory / Social networks / Software / Speech / Statistics / Word boundaries

2019

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Are you sure you want to remove the book from the shelf?

Out-domain Chinese new word detection with statistics-based character embedding

by Zhu, Jia , Yang, Min , Yiu, S M , Liang, Yuzhi

2019

Confirm

Do you wish to request the book?

Out-domain Chinese new word detection with statistics-based character embedding

by Zhu, Jia , Yang, Min , Yiu, S M , Liang, Yuzhi

2019

Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy

How would you like to get it?

Submit

We have requested the book for you!

Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.

Oops! Something went wrong.

Looks like we were not able to place your request. Kindly try again later.

Journal Article

Out-domain Chinese new word detection with statistics-based character embedding

Zhu, Jia,

Yang, Min,

Yiu, S M,

Liang, Yuzhi

2019

Overview

Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.

Share this book

Add to My Shelf

Publisher

Cambridge University Press

Subject

Artificial intelligence