SkillAgentSearch skills...

TSDS

Implementation of TSDS: Data Selection for Task-Specific Model Finetuning. An optimal-transport framework for selecting domain-specific and task-specific training data to improve LLM finetuning and instruction tuning.

Install / Use

/learn @ZifanL/TSDS

README

TSDS

This repository contains the implementation of the framework described in TSDS: Data Selection for Task-Specific Model Finetuning.

Prerequisites

Before running the project, ensure you have Python installed. You can download the latest version of Python from here.

Installation

  1. Clone the repository:

    git https://github.com/ZifanL/TSDS.git
    cd TSDS
    
  2. Install the required dependencies from the requirements.txt file:

    pip install -r requirements.txt
    
  3. (Optional) If you're using faiss-gpu, ensure you have the correct GPU drivers installed. Refer to the Faiss documentation for more information.

Usage

After installing the dependencies, you can run the project as follows using the toy data:

python tsds.py

In the output folder, the output file selected_candidate_indices.npy will contain the indices of the selected candidates.

To run TSDS on your customized data, two embedding files are needed:

  • An .npy file that stores the embeddings of the candidate examples. The shape of the array should be (number of candidates, embedding dimensions)
  • An .npy file that stores the embeddings of the query examples. The shape of the array should be (number of query examples, embedding dimensions) Change the file paths in config.yaml. Adjust the parameters in config.yaml as needed. The implementation uses faiss.IndexIVFFlat for approximate nearest neighbor search. To use a customized index, add it to faiss_helper.py and substitute FaissIndexIVFFlat in tsds.py.

Citation

Please cite our paper if you find this repo helpful in your work:

@inproceedings{
	liu2024tsds,
	title={{TSDS}: Data Selection for Task-Specific Model Finetuning},
	author={Zifan Liu and Amin Karbasi and Theodoros Rekatsinas},
	booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
	year={2024},
	url={https://openreview.net/forum?id=wjbTHLUSzU}
}
View on GitHub
GitHub Stars18
CategoryProduct
Updated17d ago
Forks2

Languages

Python

Security Score

95/100

Audited on Mar 14, 2026

No findings