Skip to content

Project

• Chinese Word Vectors   

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

• Ancient Chinese Punctuation

This project is a tool to segment and punctuate ancient Chinese text. Because of the flexibility in ancient Chinese, the segmentation and punctuation of ancient Chinese text is a hard task. Since there is no segmentation and punctuation in ancient Chinese text, people who study and collate ancient books have to spend a lot of time in segmenting and punctuating them. With a huge number of training data (3.3 billion characters), the tool achieves an impressive performance (F1 > 0.9 in segmentation and F1 > 0.8 in punctuation) on a large test set, which contains various styles, e.g. prose, poetry, Song Ci, prescription, Buddhist Sutra, etc. The tool can not only help researchers collate ancient books but also correct books which are completed.

• AI Taiyan: A General Language Model For Ancient Chinese

This project will be published soon.