Skip to main content

Tencent Releases Conan-Embedding-V2

·1966 words·10 mins
Tencent Conan-Embeddeding-V2
Table of Contents

Introduction
#

Embedding models are a crucial component for retrieval and recall in Retrieval-Augmented Generation (RAG). Our team released the Chinese Embedding model Conan-Embedding-V1 at the end of August 2024, achieving state-of-the-art (SOTA) performance on the CMTEB leaderboard and open-sourcing it on Hugging Face. Conan-Embedding-v1 has gained widespread attention in the open-source community, helping practitioners achieve benefits in many fields such as search, recommendation, and RAG. Recently, our team continued its exploration in the Embedding domain and released the Conan-Embedding-V2 version. It is based on our original trained Conan-1.4B large language model backbone and has achieved Chinese and English SOTA performance on MTEB, surpassing larger-scale models from NVIDIA, Qwen, and others.

Introduction
#

Background
#

For relevant background information about Embedding models, please refer to our V1 introduction. With the breakthroughs of DeepSeek and Manus, the application scenarios of Embeddings in RAG have become increasingly clear and important. In the V1 version, we mainly trained the Embedding task based on a general pre-trained bidirectional Bert model. In the V2 version, we trained a large language model backbone with an original vocabulary and model structure from scratch – Conan-1.4B. Based on this, we trained Chinese, English, and multilingual Embedding tasks, achieving Chinese and English SOTA on the MTEB leaderboard and supporting multilingual capabilities, taking the lead in supporting Chinese-English cross-lingual retrieval.

Tencent Embedded-v2
[Figure 1] Schematic diagram of the performance/parameter size of Conan-Embedding-v2 and mainstream Embedding models.

MTEB Leaderboard Results
#

English Results

Tencent Embedded-v2

Chinese Results

Tencent Embedded-v2

New Features

  • Language Support: Chinese SOTA –> Chinese & English SOTA, Multilingual Capabilities
  • Cross-lingual Retrieval: Chinese <–> English mutual retrieval
  • Context Length: 512 –> 32k
  • Backbone: Pre-trained Bert model –> Pre-trained Conan-1.4B large language model backbone from scratch

Model Experience

Main Methods
#

Tencent Embedded-v2
[Figure 2] Framework Schematic Diagram. Large Language Model (LLM)

Pre-training, LLM Supervised Fine-tuning (SFT), Embedding Weakly Supervised Training, and Embedding Supervised Training.

The Conan-embedding-v2 training process is divided into four stages, with each stage differing in data format and loss function. In the Large Language Model (LLM) training stages (Stages 1 and 2), we incorporated embedding data to better align the LLM with the embedding task. In the weakly supervised training stage, we used the same paired data as in LLM Supervised Fine-tuning (SFT) and applied a soft mask to bridge the gap between the LLM and the embedding model. In the supervised training stage, benefiting from LLM training, we introduced cross-lingual retrieval datasets and a dynamic hard negative mining method to enhance the diversity and value of the data.

LLM Training
#

We designed Conan-1.4B, which includes 8 Attention Layers, a Hidden Size of 3584, and a maximum context length of 32k. It has 1.4 billion parameters, enabling it to provide larger Embedding dimensions with fewer parameters.

We trained Conan’s Byte-Pair Encoding (BPE) tokenizer from basic letters and symbols on approximately 400,000 multilingual data samples, with a target vocabulary size of 150,000, completing the vocabulary training. We evaluated the encoding efficiency of our self-trained Tokenizer on Chinese and English corpora. Compared to Qwen’s Tokenizer, Conan’s tokenizer performed well.

Tencent Embedded-v2
[Table 1] Comparison of Conan Tokenizer Encoding Efficiency

As shown in Figure 2, we first pre-trained the model on approximately 3 trillion tokens of general data, with a focus on increasing targeted data suitable for Pair tasks. We adopted the standard data filtering methods described in Internlm2 for filtering.

Subsequently, we collected approximately 600 million supervised fine-tuning (SFT) data samples, organized in the form of paired data (Query - Positive Sample), with the format of instruction, input, and output.

Embedding Training
#

Weakly Supervised Training

In embedding training, we first implemented weakly supervised training to enable the model to initially learn embedding representations. In this stage, we used the same data as in LLM supervised fine-tuning but with different data formats and loss functions. Specifically, we treated the instruction and input as the query and the output as the positive passage.

To ensure higher data quality, we used the gte-Qwen2-7B-instruct model for scoring and discarded data with scores below 0.4.

To efficiently and effectively utilize the paired data, we employed the InfoNCE loss function during training, combined with In-Batch Negative sampling. The formula is as follows:

Tencent Embedded-v2

Where

  • \(x_i\) represents the query of the positive sample​
  • \(y^+_i\) represents the positive sample’s passage
  • \(y_i\) represents the passages of other samples in the batch, which are treated as negative examples.

Supervised Training
#

After weakly supervised training, we performed task-specific fine-tuning for different downstream tasks. As shown in Figure 2, we categorized the tasks into four types: Retrieval, Cross-lingual Retrieval, Classification, and Semantic Textual Similarity (STS). The first three types of tasks include a query, a positive text, and some negative texts, using the classic InfoNCE loss function. The STS task involves distinguishing the similarity between two texts, and its classic loss function is cross-entropy loss. We adopted the CoSENT loss to optimize the STS task, with the following formula:

Tencent Embedded-v2

Where

\( Order = sim(i, j) > sim(k, l)\)

\(sim(k, l)\) is the ground truth similarity between \(x_i\) and \(x_j\), \(<x_k, x_l>\) represents the cosine similarity between \(x_k\) and \(x_l\), and \(T\) is the temperature scaling parameter.

Main Strategies
#

SoftMask
#

During the LLM training phase, a causal mask is used to ensure that the current token cannot access subsequent tokens, which is suitable for token-level language modeling. However, embedding training requires a holistic understanding of sentences and uses a bidirectional mask for vector-level modeling. There are several key gaps between these two types of masks.

First, since the upper triangular part of the causal mask is all zeros, the attention weights in this region are not used during forward propagation. When directly switching to a bidirectional mask, these weights need to undergo a learning process to become effective. Second, the causal mask is full-rank and has stronger expressive power, while the rank of the bidirectional mask is always 1. If we directly switch from a causal mask to a bidirectional mask during the weakly supervised fine-tuning stage, training may converge quickly due to the low rank but is prone to getting stuck in local optima, making further optimization difficult.

As shown in Figure 2, to address these gaps, we introduce a novel soft mask mechanism. First, to solve the attention weight issue, we introduce a term \( \alpha(t) \) in the soft mask, where \( \alpha(t) \) is our scheduling function that gradually transitions the mask from 0 to 1, allowing the model to progressively update these parameters. \(T\) is set as the total number of steps for normalization. \( \alpha(t) \) is defined as follows:

Tencent Embedded-v2
[Figure 3] Comparison of Loss Curves Before and After Using the Soft Mask Mechanism

As shown in Figure 3, we plotted the loss curves with and without the soft mask mechanism. The results indicate that in the initial stages, the loss decreases more slowly with the soft mask than without it. However, the final loss achieved with the soft mask is lower. This suggests that the soft mask method enables the model to learn more comprehensive feature representations in the early stages of training.

As training progresses, the rank is reduced to retain the most important features. This rank reduction process acts as a regularization technique, helping to prevent overfitting.

Cross-lingual Retrieval Dataset

To develop a multilingual LLM, our goal is for Conan-embedding-v2 to learn representations across different languages. Previous research has primarily focused on directly using multilingual corpora for fine-tuning or using parallel corpora (where texts are translations), but often overlooks the inherent connections between languages. To address this issue, we propose a Cross-lingual Retrieval dataset (CLR) that integrates representations from different languages through cross-lingual search, thereby bridging the representation gap between them.

We started with existing retrieval datasets and expanded them to support cross-lingual retrieval. To reduce workload, we only used the query portion of the Qwen2.5-7B translated dataset. For example, we translated queries from a subset of MSMARCO (an English retrieval task) into Chinese to enable Chinese-to-English retrieval. Similarly, we applied this method to other tasks, translating queries to support cross-lingual retrieval between 26 languages, ultimately generating approximately 10 million data pairs.

Tencent Embedded-v2
[Figure 4] Comparison of Embedding Distribution Before and After Training on the Cross-lingual Retrieval Dataset

To more intuitively visualize the embedding distribution, we conducted a comparative analysis of embedding distributions. We used the Multilingual Amazon Reviews Corpus, which is not included in our cross-lingual retrieval dataset. This corpus includes reviews in English, Japanese, German, French, Chinese, and Spanish. For each language, we sampled 1000 sentences from the test set. As shown in Figure 4, the “vanilla” method represents our model without the CLR dataset. The embeddings of the six different languages are clearly clustered, with each language occupying a separate region in the distribution space. In contrast, our model, Conan-embedding-v2, successfully integrates the embeddings of all languages into a unified distribution, demonstrating its effectiveness in creating more consistent multilingual representations.

Dynamic Hard Negative Mining

For a detailed introduction to this strategy, please refer to our Conan-V1 technical report: Conan-Embedding-V1.

Data
#

To achieve the multilingual capabilities of Conan-embedding-v2, we collected a large amount of diverse data for weakly supervised pre-training and embedding fine-tuning.

For weakly supervised pre-training, we primarily collected title-content pairs from news articles and websites, systematically removing low-quality samples, redundant duplicate content, and potentially harmful information. For supervised training, we organized datasets for five different tasks: Retrieval, Re-ranking, Classification, Clustering, and Semantic Textual Similarity (STS).

Tencent Embedded-v2
[Table 2] Data usage status

Experimental Results
#

Main Results
#

Tencent Embedded-v2
[Table 3] MTEB Chinese and English Results
Table 3 details the comparison of our method’s performance on the MTEB-English and MTEB-Chinese benchmarks. Conan-embedding-v2 achieves SOTA results on both English and Chinese MTEB tests, demonstrating excellent performance on CLS tasks (English 91.11, Chinese 76.8) and ReRank tasks (English 51.49, Chinese 73.69) through various training strategies.

Ablation Study Results
#

Tencent Embedded-v2
[Table 4] Ablation Study Results
Table 4 systematically evaluates the contribution of each part of the framework through ablation experiments. These results verify the synergistic effect of Conan-embedding-v2’s components in enhancing the model’s overall capabilities.

Using only the cross-lingual retrieval task objective (row 2) improves multilingual performance to 62.69% (a 1.96% increase compared to using SM alone) while maintaining stable monolingual scores, demonstrating its targeted ability in cross-lingual representation optimization.

Using only dynamic hard negative mining (row 3) achieves the best monolingual results among single components (English 71.50%/Chinese 72.09%), validating its effectiveness in distinguishing fine-grained semantic boundaries through adaptive negative sampling.

The combination of SM+CLR (row 4) significantly boosts multilingual performance to 64.45% (a 3.56% increase compared to using SM alone).

The combination of SM+DHNM (row 5) reaches peak monolingual scores before full integration. However, these two partial combinations reveal a precision trade-off between multilingual and monolingual tasks.

Our complete framework including all components (last row) resolves this trade-off by synergistically combining the initialization stability of SM, the cross-lingual alignment of CLR, and the discriminative training of DHNM, achieving SOTA performance across all tasks.

Comparison with Mainstream Models
#

Tencent Embedded-v2
[Table 5] Comparison with Mainstream Models
In Table 5, we compare several representative models with our model. The Conan-embedding-v2 model achieves SOTA with its 1503 million parameters (approximately 1.4B) and an embedding dimension of 3584. It demonstrates excellent performance and balance in terms of model size, output dimension, inference time, and performance. The MRL and longer output dimension also provide more possibilities for the model’s application in real-world scenarios.

Conclusion and Outlook
#

This paper introduces Conan-embedding-v2, detailing the entire process from LLM model definition, vocabulary training, pre-training, to embedding training. It addresses the data and training gaps between LLM and embedding models. By leveraging the paired data from LLM training, SoftMask for weakly supervised embedding training, the cross-lingual retrieval dataset, and dynamic hard negative mining for supervised embedding training, Conan-embedding-v2 achieves SOTA performance while maintaining a relatively small model size and inference speed.

Related

Good News About Intel 18A Technology
·987 words·5 mins
Intel 18A
TSMC US Plant Suffers a Massive 5.5 Billion Loss in 4 Years
·791 words·4 mins
TSMC US Plant 18A
NVIDIA H20 Ban on China
·1064 words·5 mins
NVIDIA H20 Huawei Ascend 920