<h1 align="center">BCEmbedding: Bilingual and Crosslingual Embedding for RAG</h1> <div align="center"> <a href="./LICENSE"> <img src="https://img.shields.io/badge/license-Apache--2.0-yellow"> </a> <a href="https://twitter.com/YDopensource"> <img src="https://img.shields.io/badge/follow-%40YDOpenSource-1DA1F2?logo=twitter&style={style}"> </a> </div> <br> <p align="center"> <strong style="background-color: green;">English</strong> | <a href="./README_zh.md" target="_Self">简体中文</a> </p> <details open="open"> <summary>Click to Open Contents</summary>

<a href="#-bilingual-and-crosslingual-superiority" target="_Self">🌐 Bilingual and Crosslingual Superiority</a>
<a href="#-key-features" target="_Self">💡 Key Features</a>
<a href="#-latest-updates" target="_Self">🚀 Latest Updates</a>
<a href="#-model-list" target="_Self">🍎 Model List</a>
<a href="#-manual" target="_Self">📖 Manual</a>
- <a href="#installation" target="_Self">Installation</a>
- <a href="#quick-start" target="_Self">Quick Start (transformers, sentence-transformers)</a>
- <a href="#embedding-and-reranker-integrations-for-rag-frameworks" target="_Self">Embedding and Reranker Integrations for RAG Frameworks (langchain, llama_index)</a>
<a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
- <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
- <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
<a href="#-leaderboard" target="_Self">📈 Leaderboard</a>
- <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>
- <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>
<a href="#-youdaos-bcembedding-api" target="_Self">🛠 Youdao's BCEmbedding API</a>
<a href="#-wechat-group" target="_Self">🧲 WeChat Group</a>
<a href="#%EF%B8%8F-citation" target="_Self">✏️ Citation</a>
<a href="#-license" target="_Self">🔐 License</a>
<a href="#-related-links" target="_Self">🔗 Related Links</a>

</details> <br>

Bilingual and Crosslingual Embedding (BCEmbedding) in English and Chinese, developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks.

BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implementation, notably QAnything [github], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation.

Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves

A high performance on <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>;
A new benchmark in the realm of <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>.

Our Goals

Provide a bilingual and crosslingual two-stage retrieval model repository for the RAG community, which can be used directly without finetuning, including EmbeddingModel and RerankerModel:

One Model: EmbeddingModel handle bilingual and crosslingual retrieval task in English and Chinese. RerankerModel supports English, Chinese, Japanese and Korean.
One Model: Cover common business application scenarios with RAG optimization. e.g. Education, Medical Scenario, Law, Finance, Literature, FAQ, Textbook, Wikipedia, General Conversation.
Easy to Integrate: We provide API in BCEmbedding for LlamaIndex and LangChain integrations.
Others Points:
- RerankerModel supports long passages (more than 512 tokens, less than 32k tokens) reranking;
- RerankerModel provides meaningful relevance score that helps to remove passages with low quality.
- EmbeddingModel does not need specific instructions.

Third-party Examples

RAG applications: QAnything, ragflow, HuixiangDou, ChatPDF.
Efficient inference: ChatLLM.cpp, Xinference, mindnlp (Huawei GPU).

🌐 Bilingual and Crosslingual Superiority

Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.

EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko).

💡 Key Features

Bilingual and Crosslingual Proficiency: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
RAG-Optimized: Tailored for diverse RAG tasks including translation, summarization, and question answering, ensuring accurate query understanding. See <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>.
Efficient and Precise Retrieval: Dual-encoder for efficient retrieval of EmbeddingModel in first stage, and cross-encoder of RerankerModel for enhanced precision and deeper semantic analysis in second stage.
Broad Domain Adaptability: Trained on diverse datasets for superior performance across various fields.
User-Friendly Design: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task.
Meaningful Reranking Scores: RerankerModel provides relevant scores to improve result quality and optimize large language model performance.
Proven in Production: Successfully implemented and validated in Youdao's products.

🚀 Latest Updates

2024-02-04: Technical Blog - See <a href="https://zhuanlan.zhihu.com/p/681370855">为RAG而生-BCEmbedding技术报告</a>.
2024-01-16: LangChain and LlamaIndex Integrations - See <a href="#embedding-and-reranker-integrations-for-rag-frameworks" target="_Self">more</a>.
2024-01-03: Model Releases - bce-embedding-base_v1 and bce-reranker-base_v1 are available.
2024-01-03: Eval Datasets [CrosslingualMultiDomainsDataset] - Evaluate the performance of RAG, using LlamaIndex.
2024-01-03: Eval Datasets [Details] - Evaluate the performance of crosslingual semantic representation, using MTEB.

🍎 Model List

📖 Manual

Installation

First, create a conda environment and activate it.

conda create --name bce python=3.10 -y
conda activate bce

Then install BCEmbedding for minimal installation (To avoid cuda version conflicting, you should install torch that is compatible to your system cuda version manually first):

pip install BCEmbedding==0.1.5

Or install from source (recommended):

git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .

Quick Start

1. Based on `BCEmbedding`

Use EmbeddingModel, and cls pooler is default.

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

Use RerankerModel to calculate relevant scores and rerank:

from BCEmbedding import RerankerModel

# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']

# construct

BCEmbedding

Install / Use

README