High-Performance Chinese Language Models Built on Quality Data and Advanced Engineering

01AI has introduced the Yi model family, a series of language and multimodal models with multidimensional capabilities. Based on 6B and 34B pre-trained language models, the Yi models are extended to chat models, long context models, depth-upscaled models, and vision-language models. The models are built on scalable supercomputing infrastructure and the classical transformer architecture, and are pretrained on 31 trillion tokens of English and Chinese corpora. The Yi models perform strongly on benchmarks like MMLU and have a high human preference rate on platforms like AlpacaEval and Chatbot Arena, thanks to 01AI’s focus on data quality.

What is the Yi Model Family by 01AI?

The Yi model family, introduced by 01AI, is a series of language and multimodal models that demonstrate strong multidimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, which are then extended to chat models, 200K long context models, depth-upscaled models, and vision-language models. The base models achieve strong performance on a wide range of benchmarks like MMLU, and the finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena.

The performance of Yi models is primarily attributed to its data quality, which is a result of 01AI’s data engineering efforts. For pretraining, 01AI constructs 31 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, a small scale (less than 10K instruction) dataset is polished over multiple iterations, with every single instance verified directly by machine learning engineers.

For vision-language, the chat language model is combined with a vision transformer encoder, and the model is trained to align visual representations to the semantic space of the language model. The context length is further extended to 200K through lightweight continual pretraining, demonstrating strong needle-in-a-haystack retrieval performance. The depth of the pretrained checkpoint is extended through continual pretraining, further improving performance.

How Does the Yi Model Family Work?

The Yi model family works by building upon scalable supercomputing infrastructure and the classical transformer architecture. The models are pretrained from scratch on 31T highly-engineered large amount of data and finetuned on a small but meticulously polished alignment data. The data quality, resulting from substantial engineering efforts, allows Yi to achieve near GPT-3.5 benchmark scores and human preferences.

In designing the Yi model series, three dimensions are considered: model scale, data scale, and data quality. The model scale is chosen to be small enough for inference on consumer-grade hardware like the RTX 4090, yet still large enough with complex reasoning and emergent abilities. The pretrain data scale is increased to 31T tokens to compensate for the decreased compute flops. The data engineering principle is to promote quality over quantity for both pretraining and finetuning.

What Makes the Yi Model Family Unique?

The uniqueness of the Yi model family lies in its data cleaning system, which features a sophisticated filtering pipeline based on language heuristic textual features, perplexity, semantics, topic, and safety, as well as a cascaded deduplication process based on paragraph MinHash and exact matching. This thorough pipeline leads to a much higher removal ratio than existing pipelines, which is key to the success of data engineering.

The underlying principle is that although pretraining requires data scaling, it is important to ensure the data used are of high quality rather than training the model on large raw data. Regarding the model architecture, standard implementation of the Transformer architecture with GroupedQuery Attention (GQA), SwiGLU activation, and RoPE with an adjusted base frequency (RoPE ABF) is used.

How Does the Yi Model Family Perform?

The Yi model family performs strongly on a wide range of benchmarks. The base models achieve strong performance on benchmarks like MMLU, and the finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. The performance of Yi models is primarily attributed to its data quality, which is a result of 01AI’s data engineering efforts.

The context length is further extended to 200K through lightweight continual pretraining, demonstrating strong needle-in-a-haystack retrieval performance. The depth of the pretrained checkpoint is extended through continual pretraining, further improving performance.

What is the Future of the Yi Model Family?

Given the current results, 01AI believes that continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models. The Yi model family is a step towards the vision of making large language models the next generation computational platform and empowering the whole community with significantly amplified intelligence.

The data engineering principle of promoting quality over quantity for both pretraining and finetuning is expected to continue. The pretraining data quality is guaranteed by a sophisticated data cleaning pipeline with cascaded filtering methods and intentionally increased deduplication strength. For finetuning data, the emphasis is on quality by handcrafting less than 10K instructions over multiple iterations based on user feedback.

Publication details: “Yi: Open Foundation Models by 01.AI”
Publication Date: 2024-03-07
Authors: 01. AI, . ., Alexander S. Young, Bei Chen, et al.
Source: arXiv (Cornell University)
DOI: https://doi.org/10.48550/arxiv.2403.04652

Quantum News

Quantum News

As the Official Quantum Dog (or hound) by role is to dig out the latest nuggets of quantum goodness. There is so much happening right now in the field of technology, whether AI or the march of robots. But Quantum occupies a special space. Quite literally a special space. A Hilbert space infact, haha! Here I try to provide some of the news that might be considered breaking news in the Quantum Computing space.

Latest Posts by Quantum News:

Scientists Guide Zapata's Path to Fault-Tolerant Quantum Systems

Scientists Guide Zapata’s Path to Fault-Tolerant Quantum Systems

December 22, 2025
NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

NVIDIA’s ALCHEMI Toolkit Links with MatGL for Graph-Based MLIPs

December 22, 2025
New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

New Consultancy Helps Firms Meet EU DORA Crypto Agility Rules

December 22, 2025