Question and Answer based on Anything

<p align="center"> <a href="./README.md">English</a> | <a href="./README_zh.md">简体中文</a> </p> </div> <div align="center">

</div> <details open="open"> <summary>Table of Contents</summary>

What is QAnything
- Key features
- Architecture
Latest Updates
Before You Start
Getting Started
Roadmap & Feedback
Community & Support
License
Acknowledgements

</details>

🚀 Important Updates

<h1><span style="color:red;">Important things should be said three times.</span></h1>

[2024-08-23: QAnything updated to version 2.0.]

<h2>

<span style="color:green">This update brings improvements in various aspects such as usability, resource consumption, search results, question and answer results, parsing results, front-end effects, service architecture, and usage methods.</span>
<span style="color:green">At the same time, the old Docker version and Python version have been merged into a new unified version, using a single-line command with Docker Compose for one-click startup, ready to use out of the box.</span>

</h2>

Contributing

We appreciate your interest in contributing to our project. Whether you're fixing a bug, improving an existing feature, or adding something completely new, your contributions are welcome!

Thanks to all contributors for their efforts

Special thanks!

<h2><span style="color:red;">Please note: Our list of contributors is automatically updated, so your contributions may not appear immediately on this list.</span></h2> <h2><span style="color:red;">Special thanks!：@ikun-moxiaofei</span></h2> <h2><span style="color:red;">Special thanks!：@Ianarua</span></h2>

Business contact information：

010-82558901

What is QAnything?

QAnything(Question and Answer based on Anything) is a local knowledge base question-answering system designed to support a wide range of file formats and databases, allowing for offline installation and use.

With QAnything, you can simply drop any locally stored file of any format and receive accurate, fast, and reliable answers.

Currently supported formats include: PDF(pdf),Word(docx),PPT(pptx),XLS(xlsx),Markdown(md),Email(eml),TXT(txt),Image(jpg，jpeg，png),CSV(csv),Web links(html) and more formats coming soon…

Key features

Data security, supports installation and use by unplugging the network cable throughout the process.
Supports multiple file types, high parsing success rate, supports cross-language question and answer, freely switches between Chinese and English question and answer, regardless of the language of the file.
Supports massive data question and answer, two-stage vector sorting, solves the problem of degradation of large-scale data retrieval, the more data, the better the effect, no limit on the number of uploaded files, fast retrieval speed.
Hardware friendly, defaults to running in a pure CPU environment, and supports multiple platforms such as Windows, Mac, and Linux, with no dependencies other than Docker.
User-friendly, no need for cumbersome configuration, one-click installation and deployment, ready to use, each dependent component (PDF parsing, OCR, embed, rerank, etc.) is completely independent, supports free replacement.
Supports a quick start mode similar to Kimi, fileless chat mode, retrieval mode only, custom Bot mode.

Architecture

Why 2 stage retrieval?

In scenarios with a large volume of knowledge base data, the advantages of a two-stage approach are very clear. If only a first-stage embedding retrieval is used, there will be a problem of retrieval degradation as the data volume increases, as indicated by the green line in the following graph. However, after the second-stage reranking, there can be a stable increase in accuracy, the more data, the better the performance.

QAnything uses the retrieval component BCEmbedding, which is distinguished for its bilingual and crosslingual proficiency. BCEmbedding excels in bridging Chinese and English linguistic gaps, which achieves

A high performance on <a href="https://github.com/netease-youdao/BCEmbedding/tree/master?tab=readme-ov-file#evaluate-semantic-representation-by-mteb" target="_Self">Semantic Representation Evaluations in MTEB</a>;
A new benchmark in the realm of <a href="https://github.com/netease-youdao/BCEmbedding/tree/master?tab=readme-ov-file#evaluate-rag-by-llamaindex" target="_Self">RAG Evaluations in LlamaIndex</a>.

1st Retrieval（embedding）

| Model | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | Avg |
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| bge-base-en-v1.5 | 37.14 | 55.06 | 75.45 | 59.73 | 43.05 | 37.74 | 47.20 |
| bge-base-zh-v1.5 | 47.60 | 63.72 | 77.40 | 63.38 | 54.85 | 32.56 | 53.60 |
| bge-large-en-v1.5 | 37.15 | 54.09 | 75.00 | 59.24 | 42.68 | 37.32 | 46.82 |
| bge-large-zh-v1.5 | 47.54 | 64.73 | 79.14 | 64.19 | 55.88 | 33.26 | 54.21 |
| jina-embeddings-v2-base-en | 31.58 | 54.28 | 74.84 | 58.42 | 41.16 | 34.67 | 44.29 |
| m3e-base | 46.29 | 63.93 | 71.84 | 64.08 | 52.38 | 37.84 | 53.54 |
| m3e-large | 34.85 | 59.74 | 67.69 | 60.07 | 48.99 | 31.62 | 46.78 |
| bce-embedding-base_v1 | 57.60 | 65.73 | 74.96 | 69.00 | 57.29 | 38.95 | 59.43 |

More evaluation details please check Embedding Models Evaluation Summary。

2nd Retrieval（rerank）

| Model | Reranking | Avg |
|:-------------------------------|:--------:|:--------:|
| bge-reranker-base | 57.78 | 57.78 |
| bge-reranker-large | 59.69 | 59.69 |
| bce-reranker-base_v1 | 60.06 | 60.06 |

More evaluation details please check Reranker Models Evaluation Summary

RAG Evaluations in LlamaIndex（embedding and rerank）

NOTE:

In WithoutReranker setting, our bce-embedding-base_v1 outperforms all the other embedding models.
With fixing the embedding model, our bce-reranker-base_v1 achieves the best performance.
The combination of bce-embedding-base_v1 and bce-reranker-base_v1 is SOTA.
If you want to use embedding and rerank separately, please refer to BCEmbedding

LLM

The open source version of QAnything is based on QwenLM and has been fine-tuned on a large number of professional question-answering datasets. It greatly enhances the ability of question-answering. If you need to use it for commercial purposes, please follow the license of QwenLM. For more details, please refer to: QwenLM

QAnything

Install / Use

README

Question and Answer based on Anything

🚀 Important Updates

[2024-08-23: QAnything updated to version 2.0.]

[2024-08-23: QAnything updated to version 2.0.]

[2024-08-23: QAnything updated to version 2.0.]

Contributing

Thanks to all contributors for their efforts

Special thanks!

Business contact information：

010-82558901

What is QAnything?

Key features

Architecture

Why 2 stage retrieval?

1st Retrieval（embedding）

2nd Retrieval（rerank）

RAG Evaluations in LlamaIndex（embedding and rerank）

LLM

🚀 Latest Updat