The pursuit of effective text embeddings, which translate words and sentences into meaningful numerical representations, drives advances in numerous artificial intelligence applications. Ziyin Zhang from Ant Group and Shanghai Jiao Tong University, alongside Zihan Liao and Hang Yu, now present F2LLM, a new family of embeddings achieving state-of-the-art performance with a surprisingly small footprint. Unlike existing top-performing models that demand extensive and expensive training, F2LLM learns directly from a carefully curated collection of six million publicly available text examples. This approach yields models, including a 4 billion parameter version ranking second on a leading benchmark for embedding quality, and a 1. 7 billion parameter model that currently leads its size category, offering a strong, reproducible and cost-effective baseline for future research in the field.
Linq-Embed-Mistral, LGAI-Embedding, F2LLM 0. 6B, F2LLM 1. 7B, and F2LLM 4B represent a range of model sizes investigated in this study, comparing their embedding performance against Qwen3-Embed 0. 6B, Qwen3-Embed 4B, Qwen3-Embed 8B, Gemini Embed, and NV-Embed-v2. The research demonstrates that F2LLM, trained solely on open-source, non-synthetic data, achieves a strong balance between embedding performance and model size, revealing a clear relationship between model size, training data, and resulting performance.
The research team developed F2LLM, a suite of text embedding models available in three sizes, 0. 6 billion, 1. 7 billion, and 4 billion parameters, designed to efficiently represent the meaning of text. Unlike many leading embedding models requiring extensive pre-training with large, artificially created datasets, F2LLM achieves strong performance through direct fine-tuning on a carefully curated dataset of 6 million query-document-negative tuples sourced from openly available resources. This approach reduces training costs while maintaining high embedding quality, prioritizing practical application by balancing model size, training expense, and performance.
Employing the MTEB English leaderboard, scientists found that the 4 billion parameter version, F2LLM-4B, achieved a rank of 2nd among models with approximately 4 billion parameters and 7th overall, while the 1. 7 billion parameter model, F2LLM-1. 7B, secured the top position within the 1 billion to 2 billion parameter size range. To promote reproducibility and further research, the team released the training dataset and associated code alongside the models, establishing a strong, reproducible baseline for future advancements. This open-access approach enables other researchers to validate the findings, build upon the work, and explore new applications of F2LLM, pioneering a cost-effective and accessible pathway for developing high-quality text embeddings.
F2LLM Achieves Top Ranking Embedding Performance
The research team introduces F2LLM, a suite of advanced embedding models available in three sizes, 0. 6 billion, 1. 7 billion, and 4 billion parameters, that demonstrate a strong balance between model size, training cost, and performance. Unlike existing top-performing embedding models requiring extensive contrastive pretraining and synthetic data, F2LLM is directly finetuned from foundation models using 6 million query-document-negative tuples sourced from publicly available, non-synthetic datasets. Experiments on the MTEB English leaderboard reveal that the F2LLM-4B model achieved a rank of 2nd among models with approximately 4 billion parameters and 7th overall, while the F2LLM-1.
7B model ranks 1st among models in the 1 billion to 2 billion parameter range, and the F2LLM-0. 6B model secures 2nd place among models with fewer than 1 billion parameters. The team’s measurements show that F2LLM excels in clustering tasks, with the 4 billion parameter model achieving a score of 68. 54, establishing a new state-of-the-art result. To facilitate further research, the model checkpoints, training dataset, and training code are released, positioning F2LLM as a reproducible and budget-friendly baseline for future work in text embedding.
F2LLM Achieves Leading Embedding Performance with Finetuning
The team presents F2LLM, a family of text embedding models available in three sizes, achieving a strong balance between model size, training data requirements, and embedding performance. Unlike many leading embedding models that demand extensive pre-training and synthetic data, F2LLM is directly finetuned from foundation models using openly available datasets. Results demonstrate that F2LLM-4B ranks second among models with approximately four billion parameters on a standard leaderboard, while the 1. 7B version leads models in the one to two billion parameter range. To encourage further research, the model checkpoints, training dataset, and associated code are released, establishing F2LLM as a reproducible and cost-effective baseline for future work in text embedding.
👉 More information
🗞 F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
🧠 ArXiv: https://arxiv.org/abs/2510.02294
