BCEmbedding
Netease Youdao's open-source embedding and reranker models for RAG products.
Install / Use
/learn @netease-youdao/BCEmbeddingREADME
- <a href="#-bilingual-and-crosslingual-superiority" target="_Self">🌐 Bilingual and Crosslingual Superiority</a>
- <a href="#-key-features" target="_Self">💡 Key Features</a>
- <a href="#-latest-updates" target="_Self">🚀 Latest Updates</a>
- <a href="#-model-list" target="_Self">🍎 Model List</a>
- <a href="#-manual" target="_Self">📖 Manual</a>
- <a href="#installation" target="_Self">Installation</a>
- <a href="#quick-start" target="_Self">Quick Start (
transformers,sentence-transformers)</a> - <a href="#embedding-and-reranker-integrations-for-rag-frameworks" target="_Self">Embedding and Reranker Integrations for RAG Frameworks (
langchain,llama_index)</a>
- <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
- <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
- <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
- <a href="#-leaderboard" target="_Self">📈 Leaderboard</a>
- <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>
- <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>
- <a href="#-youdaos-bcembedding-api" target="_Self">🛠 Youdao's BCEmbedding API</a>
- <a href="#-wechat-group" target="_Self">🧲 WeChat Group</a>
- <a href="#%EF%B8%8F-citation" target="_Self">✏️ Citation</a>
- <a href="#-license" target="_Self">🔐 License</a>
- <a href="#-related-links" target="_Self">🔗 Related Links</a>
Bilingual and Crosslingual Embedding (BCEmbedding) in English and Chinese, developed by NetEase Youdao, encompasses EmbeddingModel and RerankerModel. The EmbeddingModel specializes in generating semantic vectors, playing a crucial role in semantic search and question-answering, and the RerankerModel excels at refining search results and ranking tasks.
BCEmbedding serves as the cornerstone of Youdao's Retrieval Augmented Generation (RAG) implementation, notably QAnything [github], an open-source implementation widely integrated in various Youdao products like Youdao Speed Reading and Youdao Translation.
Distinguished for its bilingual and crosslingual proficiency, BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves
- A high performance on <a href="#semantic-representation-evaluations-in-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>;
- A new benchmark in the realm of <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>.
Our Goals
Provide a bilingual and crosslingual two-stage retrieval model repository for the RAG community, which can be used directly without finetuning, including EmbeddingModel and RerankerModel:
- One Model:
EmbeddingModelhandle bilingual and crosslingual retrieval task in English and Chinese.RerankerModelsupports English, Chinese, Japanese and Korean. - One Model: Cover common business application scenarios with RAG optimization. e.g. Education, Medical Scenario, Law, Finance, Literature, FAQ, Textbook, Wikipedia, General Conversation.
- Easy to Integrate: We provide API in
BCEmbeddingfor LlamaIndex and LangChain integrations. - Others Points:
RerankerModelsupports long passages (more than 512 tokens, less than 32k tokens) reranking;RerankerModelprovides meaningful relevance score that helps to remove passages with low quality.EmbeddingModeldoes not need specific instructions.
Third-party Examples
- RAG applications: QAnything, ragflow, HuixiangDou, ChatPDF.
- Efficient inference: ChatLLM.cpp, Xinference, mindnlp (Huawei GPU).
🌐 Bilingual and Crosslingual Superiority
Existing embedding models often encounter performance challenges in bilingual and crosslingual scenarios, particularly in Chinese, English and their crosslingual tasks. BCEmbedding, leveraging the strength of Youdao's translation engine, excels in delivering superior performance across monolingual, bilingual, and crosslingual settings.
EmbeddingModel supports Chinese (ch) and English (en) (more languages support will come soon), while RerankerModel supports Chinese (ch), English (en), Japanese (ja) and Korean (ko).
💡 Key Features
- Bilingual and Crosslingual Proficiency: Powered by Youdao's translation engine, excelling in Chinese, English and their crosslingual retrieval task, with upcoming support for additional languages.
- RAG-Optimized: Tailored for diverse RAG tasks including translation, summarization, and question answering, ensuring accurate query understanding. See <a href="#rag-evaluations-in-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>.
- Efficient and Precise Retrieval: Dual-encoder for efficient retrieval of
EmbeddingModelin first stage, and cross-encoder ofRerankerModelfor enhanced precision and deeper semantic analysis in second stage. - Broad Domain Adaptability: Trained on diverse datasets for superior performance across various fields.
- User-Friendly Design: Instruction-free, versatile use for multiple tasks without specifying query instruction for each task.
- Meaningful Reranking Scores:
RerankerModelprovides relevant scores to improve result quality and optimize large language model performance. - Proven in Production: Successfully implemented and validated in Youdao's products.
🚀 Latest Updates
- 2024-02-04: Technical Blog - See <a href="https://zhuanlan.zhihu.com/p/681370855">为RAG而生-BCEmbedding技术报告</a>.
- 2024-01-16: LangChain and LlamaIndex Integrations - See <a href="#embedding-and-reranker-integrations-for-rag-frameworks" target="_Self">more</a>.
- 2024-01-03: Model Releases - bce-embedding-base_v1 and bce-reranker-base_v1 are available.
- 2024-01-03: Eval Datasets [CrosslingualMultiDomainsDataset] - Evaluate the performance of RAG, using LlamaIndex.
- 2024-01-03: Eval Datasets [Details] - Evaluate the performance of crosslingual semantic representation, using MTEB.
🍎 Model List
| Model Name | Model Type | Languages | Parameters | Weights |
| :-------------------- | :----------------: | :------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------: |
| bce-embedding-base_v1 | EmbeddingModel | ch, en | 279M | Huggingface, 国内通道 |
| bce-reranker-base_v1 | RerankerModel | ch, en, ja, ko | 279M | Huggingface, 国内通道 |
📖 Manual
Installation
First, create a conda environment and activate it.
conda create --name bce python=3.10 -y
conda activate bce
Then install BCEmbedding for minimal installation (To avoid cuda version conflicting, you should install torch that is compatible to your system cuda version manually first):
pip install BCEmbedding==0.1.5
Or install from source (recommended):
git clone git@github.com:netease-youdao/BCEmbedding.git
cd BCEmbedding
pip install -v -e .
Quick Start
1. Based on BCEmbedding
Use EmbeddingModel, and cls pooler is default.
from BCEmbedding import EmbeddingModel
# list of sentences
sentences = ['sentence_0', 'sentence_1']
# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")
# extract embeddings
embeddings = model.encode(sentences)
Use RerankerModel to calculate relevant scores and rerank:
from BCEmbedding import RerankerModel
# your query and corresponding passages
query = 'input_query'
passages = ['passage_0', 'passage_1']
# construct
