• Chinese Word VectorsStar
This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. These pre-trained word vectors can be easily integrated into machine learning models for tasks such as text classification, sentiment analysis, and named entity recognition.
Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.
• DiscoCC: Diachronic Semantic Corpus of Classical Chinese
DiscoCC is designed to trace lexical semantic change and cultural evolution in Classical Chinese across nearly 3,000 years. We applied state-of-the-art language modeling and embedding-based techniques to annotate, align, and cluster word senses within a large-scale historical corpus. The result is a diachronic semantic resource encompassing nearly 200 million characters from texts spanning the pre-Qin era to late Qing and Republican China, covering over 65,000 word forms and 126,000 sense entries. It offers powerful tools for sense-level corpus search and tracking semantic evolution, providing crucial support for diachronic research in language and culture.
• AI Taiyan: Classical Chinese Large Language Model
AI Taiyan is a large language model for Classical Chinese, trained from scratch for this vertical domain. With just 1.8B parameters, it excels at challenging tasks such as interpretation, translation, and allusion analysis, delivering results on par with or surpassing graduate-level performance. It is widely used by users in dozens of countries to support learning, teaching, and research in Classical Chinese, and has received extensive media coverage and high praise.
• Classical Chinese Punctuation
Classical Chinese manuscripts lack spaces and punctuation, making them challenging and time-consuming to interpret.
We developed a multi-task system for punctuation and named entity recognition, which won first place in the CCL 2020 Classical Chinese NER Competition. Trained on a dataset of 3.3 billion characters, it achieves expert-level performance, with segmentation F1 over 0.95 and punctuation F1 over 0.9 across diverse texts. Widely adopted by research institutions and publishers, this tool significantly reduces manual effort, accelerating both the study and accessibility of classical Chinese literature.