SkillAgentSearch skills...

Vsearch

An Extensible Framework for Retrieval-Augmented LLM Applications: Learning Relevance Beyond Simple Similarity.

Install / Use

/learn @jzhoubu/Vsearch
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Vsearch

License Python 3.9

Vsearch: Representing data on LM <u>V</u>ocabulary space for <u>Search</u>.

This repository includes:

  • VDR: Retrieval-based Disentangled Representation Learning with Natural Language Supervision

    <details> <summary>Click to see details of VDR.</summary> VDR disentangles multi-modal data on MLM vocabulary space for interpretable and effective multimodal retrieval model. <div> <a href="https://openreview.net/forum?id=ZlQRiFmq7Y"><img src="https://img.shields.io/badge/Openreview-red.svg" alt="Openreview"></a> <a href="https://jzhoubu.github.io/vdr.github.io/"><img src="https://img.shields.io/badge/Demo-Brightgreen.svg" alt="Demo"></a> </div> </details>
  • SVDR: Semi-Parametric Retrieval via Binary Token Index

    <details> <summary>Click to see details of SVDR. </summary> <div style="font-style: italic;"> SVDR reduces the indexing time and cost to meet various scenarios, making powerful retrieval-augmented applications accessible to everyone. </div> <div align="center"> <img src="docs/images/home/svdr.png" width="100%" height="100%"> </div> </details>
<!-- <div align=center> <img src="examples/images/vdr-cover.png" width="70%" height="70%"> </div> ## What's News 🔥 - 2024-07-15: Released model cards and detailed information of supervised and unsupervised retrieval checkpoints. More information is available [here](https://github.com/jzhoubu/vsearch/tree/master/docs/model_cards). - 2024-07-13: Thrilled to see that the SVDR checkpoint has surpassed 3,000 downloads in the past month. - 2024-06-21: Released pipelines of large-scale dense and sparse retrieval on customized data. For more details, see [dense inference](https://github.com/jzhoubu/vsearch/tree/master/examples/inference_dense) and [sparse inference](https://github.com/jzhoubu/vsearch/tree/master/examples/inference_sparse). - 2024-05-17: Launched training code and pipeline. - 2024-05-08: Launched a semi-parametric inference pipeline (for low-resource, efficient, large-scale retrieval). - 2024-05-06: SVDR: [Semi-Parametric Retrieval via Binary Token Index](https://arxiv.org/pdf/2405.01924) published on arXiv. - 2024-01-16: VDR: [Retrieval-based Disentangled Representation Learning with Natural Language Supervision](https://openreview.net/pdf?id=ZlQRiFmq7Y) accepted as a spotlight at ICLR 2024. -->

🗺 Overview

  1. Preparation

    • Setup Environment
    • Download Data
  2. Quick Start

    • Embedding and Compute Relevance
    • Building an Index for Large-scale Retrieval
    • Building a Bag-of-Token Index for Faster Retrieval Setup
    • Inspecting Retrieval insights from Representation
    • Semi-parametric Search
    • Cross-modal Retrieval
  3. Training

  4. Inference

    • Build index
    • Search
    • Scoring

💻 Preparation

Setup Environment via Poetry

# install poetry first
# curl -sSL https://install.python-poetry.org | python3 -
poetry install
poetry shell

Download Data

Download data using identifiers in the YAML configuration files at conf/data_stores/*.yaml.

# Download a single dataset file
python download.py nq_train
# Download multiple dataset files:
python download.py nq_train trivia_train
# Download all dataset files:
python download.py all

🚀 Quick Start

Embedding and Compute Relevance

import torch
from src.ir import Retriever

# Define a query and a list of passages
query = "Who first proposed the theory of relativity?"
passages = [
    "Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.",
    "Sir Isaac Newton FRS (25 December 1642 – 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.",
    "Nikola Tesla (10 July 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system."
]

# Initialize the retriever
ir = Retriever.from_pretrained("vsearch/svdr-msmarco")
ir = ir.to("cuda")

# Embed the query and passages
q_emb = ir.encoder_q.embed(query)  # Shape: [1, V]
p_emb = ir.encoder_p.embed(passages)  # Shape: [4, V]

# Query-passage Relevance
scores = q_emb @ p_emb.t()
print(scores)

# Output: 
# tensor([[97.2964, 39.7844, 37.6955]], device='cuda:0')

Building an Index for Large-scale Retrieval

For large-scale retrieval tasks, it's efficient to build the index once and reuse it for subsequent retrieval tasks.

# Build the sparse index for the passages
ir.build_index(passages, index_type="sparse")
print(ir.index)

# Output:
# Index Type      : SparseIndex
# Vector Shape    : torch.Size([3, 29523])
# Vector Dtype    : torch.float32
# Vector Layout   : torch.sparse_csr
# Number of Texts : 3
# Vector Device   : cuda:0

# Save the index to disk
index_file = "/path/to/index.npz"
ir.save_index(path)

# Load the index from disk
index_file = "/path/to/index.npz"
data_file = "/path/to/texts.jsonl"
ir.load_index(index_file=index_file, data_file=data_file)

You can retrieve results for queries directly from a pre-built index.

# Search top-k results for queries
queries = [query]
results = ir.retrieve(queries, k=3)
print(results)

# Output:
# SearchResults(
#   ids=tensor([[0, 1, 2]], device='cuda:0'),
#   scores=tensor([[97.2458, 39.7507, 37.6407]], device='cuda:0')
# )

query_id = 0
top1_psg_id = results.ids[query_id][0]
top1_psg = ir.index.get_sample(top1_psg_id)
print(top1_psg)
# Output:

# Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.

Building a Bag-of-Token Index for Faster Retrieval Setup

Our framework supports a non-parametric index built directly from the tokenizer, known as the Bag-of-Token (BoT) index. This approach significantly reduces indexing time and disk storage size by over 90%. You can build and use a BoT index as follows:

# Build the bag-of-token index for the passages
ir.build_index(passages, index_type="bag_of_token")
print(ir.index)

# Output:
# Index Type      : BoTIndex
# Vector Shape    : torch.Size([3, 29523])
# Vector Dtype    : torch.float16
# Vector Layout   : torch.sparse_csr
# Number of Texts : 3
# Vector Device   : cuda:0

# Search top-k results from bag-of-token index, and embed and rerank them on-the-fly
queries = [query]
results = ir.retrieve(queries, k=3, rerank=True)
print(results)

# Output:
# SearchResults(
#   ids=tensor([0, 2, 1], device='cuda:3'), 
#   scores=tensor([97.2964, 39.7844, 37.6955], device='cuda:0')
# )

Inspecting IR insights from Representation

# Inspect token-level importance/weights of the query embeddings
token_weights = ir.encoder_q.dst(query, topk=768, visual=False) 
print(token_weights)

# Output: 
# {
#     'relativity': 7.262620449066162, 
#     'proposed': 3.588329792022705, 
#     'first': 2.918099880218506, 
#     ...
# }

# Inspect token-level contributions to the relevance score (i.e., retrieval results)
token_contributions = ir.explain(q=query, p=passages[0], topk=768, visual=False)
print(token_contributions)

# Output: 
# {
#     'relativity': 54.66442432370013, 
#     'whom': 13.934619790257784, 
#     'theory': 4.645142051911478, 
#     ...
# }
<details> <summary>Semi-parametric Retrieval</summary>

Alpha search

# non-parametric query -> parametric passage
q_bin = svdr.encoder_q.embed(query, bow=True)
p_emb = svdr.encoder_p.embed(passages)
scores = q_bin @ p_emb.t()

Beta search

# parametric query -> non-parametric passage (binary token index)
q_emb = svdr.encoder_q.embed(query)
p_bin = svdr.encoder_p.embed(passages, bow=True)
scores = q_emb @ p_bin.t()
</details> <details> <summary>Cross-modal Retrieval</summary>
# Note: we use `encoder_q` for text and `encoder_p` for image
vdr_cross_modal = Retriever.from_pretrained("vsearch/vdr-cross-modal") 

image_file = './examples/images/mars.png'
texts = [
  "Four thousand Martian days after setting its wheels in Gale Crater on Aug. 5, 2012, NASA’s Curiosity rover remains busy conducting exciting science. The rover recently drilled its 39th sample then dropped the pulverized rock into its belly for detailed analysis.",
  "ChatGPT is a chatbot developed by OpenAI and launched on November 30, 2022. Based on a large language model, it enables users to refine and steer a conversation towards a desired length, format, style, level of detail, and language."
]
image_emb = vdr_cross_modal.encoder_p.embed(image_file) # Shape: [1, V]
text_emb = vdr_cross_modal.encoder_q.embed(texts)  # Shape: [2, V]

# Image-text Relevance
scores = image_emb @ text_emb.t()
print(scores)

# Output: 
# tensor([[0.3209, 0.0984]])
</details>

👾 Training

We are testing on python==3.9 and torch==2.2.1. Configuration is handled through hydra==1.3.2.

EXPERIMENT_NAME=test
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=4 train_vdr.py \
hydra.run.dir=./experiments/${EXPERIMENT_NAME}/train \
train=vdr_nq \
data_stores=wiki21m \
train_datasets=[nq_train]
  • --hydra.run.dir: Directory where training logs and outputs will be saved
  • --train: Identifier for the training config, in `conf/tr
View on GitHub
GitHub Stars41
CategoryEducation
Updated4mo ago
Forks1

Languages

Python

Security Score

92/100

Audited on Nov 20, 2025

No findings