SemDeDup
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically similar, but not exactly identical).
Install / Use
/learn @facebookresearch/SemDeDupREADME
SemDeDup: Data-Efficient Learning at Web-scale through semantic deduplication
Code for "SemDeDup", a simple method for identifying and removing “semantic duplicates”: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning - for more information please see our paper <a href="https://arxiv.org/abs/2303.09540"> here </a>.
<p align="center"> <img src="semdedup.png" width="700" title="semdedup"> </p>Setup
We recommend using conda to install semdedup. For easy install, use semdedup_conda_env.yml to create the conda environment:
conda env create -n semdedup --file semdedup_conda_env.yml
conda activate semdedup
and you should be good to go!
<p align="center"> <img src="intro_fig.png" width="700" title="intro fig"> </p>Running the pipeline
1) Run get_embeddings(.) to compute and store embeddings from a pretrained model.
Template code:
from compute_pretrained_embeddings import get_embeddings
model = ...
dataloader = ...
path_str_type = ...
emb_memory_loc = ...
paths_memory_loc = ...
dataset_size = ...
emb_size = ...
emb_array = np.memmap(emb_memory_loc, dtype='float32', mode='w+', shape=(dataset_size, emb_size))
path_array = np.memmap(emb_memory_loc, dtype=path_str_type, mode='w+', shape=(dataset_size,))
get_embeddings(model, dataloader, emd_memmap, paths_memmap)
Argument descriptions:
dataloader: should return the following
1) data_batch: batch of data examples
2) paths_batch: path to location where the example is stored (unique identifier). For example, this could be "n04235860_14959.JPEG" for imagenet.
3) batch_indices: global index for each example (between 0 and of size <dataset_size>-1).
emd_memmap: numpy memmap to store embeddings of size <dataset_size>.
paths_memmap: numpy memmap to store paths of size <dataset_size>.
2) Run K-Means clustering on embeddings from step (1)
from clustering run the following script
and don't forget to modify the config file /configs/openclip/clustering_configs.yaml first!
python compute_centroids.py --confg-file "/configs/openclip/clustering_configs.yaml" \
--partition <partition> \
--ngpus 1 \
--cpus-per-task 10 \
--timeout 300 \
The script will submit a job using submitit. Alternatively, you can use the following python code:<br>
Template code:
import yaml
import random
import numpy as np
import logging
from clustering.clustering import compute_centroids
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler())
confg_file = "configs/openclip/clustering_configs.yaml"
## -- Load kmeans clustering parameters from configs file
with open(confg_file, 'r') as y_file:
params = yaml.load(y_file, Loader=yaml.FullLoader)
## -- Fix the seed
SEED = params['seed']
random.seed(SEED)
emb_memory_loc = params['emb_memory_loc']
paths_memory_loc = params['paths_memory_loc']
dataset_size = params['dataset_size']
emb_size = params['emb_size']
path_str_type = params['path_str_type']
emb_memory = np.memmap(emb_memory_loc, dtype='float32', mode='r', shape=(dataset_size, emb_size))
paths_memory = np.memmap(paths_memory_loc, dtype=path_str_type, mode='r', shape=(dataset_size,))
compute_centroids(
data=emb_memory,
ncentroids=params['ncentroids'],
niter=params['niter'],
seed=params['seed'],
Kmeans_with_cos_dist=params['Kmeans_with_cos_dist'],
save_folder=params['save_folder'],
logger=logger,
verbose=True,
)
Argument descriptions:
seed: random seed to run pipeline with (this does not affect results since semdedup is deterministic)
emb_memory_loc: path to numpy memmap file that stores embeddings of size <dataset_size>
paths_memory_loc: path to numpy memmap file that store paths of size <dataset_size>
dataset_size: total size of the dataset that semdedup is running on
emb_size: size of the embeddings computed in step (1)
3) Sort Clusters
from clustering run the following script
and don't forget to modify the config file /configs/openclip/clustering_configs.yaml first!
python sort_clusters.py --confg-file "/configs/openclip/clustering_configs.yaml" \
--partition <partition> \
--num-tasks 10 \
--cpus-per-task 5 \
--timeout 300 \
The script will submit a job using submitit. Alternatively, you can use the following python code:<br>
Template code:
from sort_clusters import assign_and_sort_clusters
assign_and_sort_clusters(
data=emb_memory,
paths_list=paths_memory,
sim_metric=params["sim_metric"],
keep_hard=params["keep_hard"],
kmeans_with_cos_dist=params["kmeans_with_cos_dist"],
save_folder=params["save_folder"],
sorted_clusters_file_loc=params["sorted_clusters_file_loc"],
cluster_ids=range(0, params["ncentroids"]),
logger=logger,
)
4) Run SemDeDup
Run the following script and don't forget to modify the config file semdedup_configs.yaml first!
The script uses submitit to launch jobs on N nodes. Each nodes will process parts of the clusters. We then initiate <tasks-per-node> tasks on every node to increase parallelization and improve speed.
python submit_semdedup_job.py --config-file "semdedup_configs.yaml" \
--eps-list <eps-list> \
--partition <partition> \
--nodes 8 \
--tasks-per-node 20 \
--cpus-per-task 4 \
Argument descriptions:
--config-file: configuration file.
--eps-list: list of epsilons to run semdedup with (lower epsilon will cause more agressive pruning, see https://arxiv.org/abs/2303.09540 for more details)
--tasks-per-node: we initiate <tasks-per-node> tasks on every node to increase parallelization and improve speed.
--cpus-per-task: number of cpus for each task.
5) Extract deduplicated data as paths to examples locations
Template code:
from extract_dedup_data import extract_pruned_data
output_txt_path = ...
semdedup_pruning_tables_path = ...
sorted_clusters_path = ...
eps = ...
num_clusters = ...
extract_pruned_data(sorted_clusters_path, semdedup_pruning_tables_path, eps, num_clusters, output_txt_path, retreive_kept_samples=True)
Argument descriptions:
sorted_clusters_path: path to the sorted clusters files (this is the output of step (3) above).
semdedup_dataframes_path: path to saved semdedup (this is the output of step (3) above).
eps: single value of epsilon used for running semdedup run in step (3).
num_clusters: number of clusters that k-means clustering was run with.
output_txt_path: output txt file which contains the paths to examples that are kept after running semdedup [paths are retrieved from `path_array` from step (1)].
Citation
If you use this codebase, please make sure to cite our paper:
@article{abbas2023semdedup,
title={SemDeDup: Data-efficient learning at web-scale through semantic deduplication},
author={Abbas, Amro and Tirumala, Kushal and Simig, D{\'a}niel and Ganguli, Surya and Morcos, Ari S},
journal={arXiv preprint arXiv:2303.09540},
year={2023}
}
License
Please see the License file here.
Related Skills
node-connect
340.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.1kCommit, push, and open a PR
