Skip to content

Project

• Chinese Word Vectors   

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

• Ancient Chinese Punctuation

This project is a tool to segment and punctuate ancient Chinese text. Because of the flexibility in ancient Chinese, the segmentation and punctuation of ancient Chinese text is a hard task. Since there is no segmentation and punctuation in ancient Chinese text, people who study and collate ancient books have to spend a lot of time in segmenting and punctuating them. With a huge number of training data (3.3 billion characters), the tool achieves an impressive performance (F1 > 0.9 in segmentation and F1 > 0.8 in punctuation) on a large test set, which contains various styles, e.g. prose, poetry, Song Ci, prescription, Buddhist Sutra, etc. The tool can not only help researchers collate ancient books but also correct books which are completed.

• AI Taiyan: A General Language Model For Ancient Chinese

AI Taiyan is a small large language model which focuses on ancient Chinese understanding. In key tasks related to ancient Chinese information processing such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, AI Taiyan exhibits a clear advantage over both general-purpose large models and domain-specific models, achieving or surpassing human baseline performance. The performance shows that with an appropriately designed model, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with merely 1.8 billion parameters.