Banglaparaphrase

This repository contains the code, data, and associated models of the paper titled "BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset", accepted in Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: AACL 2022.

Generate Convert Improve

Install / Use

/learn @csebuetnlp/Banglaparaphrase

About this skill

Quality Score

0/100

README

BanglaParaphrase

This repository contains the code, data, and associated models of the paper titled "BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset", accepted in Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: AACL 2022.

BanglaParaphrase

Datasets

Disclaimer: You must agree to the license and terms of use before using the dataset.

The dataset files are organized in .jsonl format i.e. one JSON per line. Download the dataset from here.

One example from the test part of the dataset is given below in JSON format.

{
  "source": "খোঁজ খবর রাখতেন বিজ্ঞানের অগ্রগতি নিয়ে।", 
  "target": "বিজ্ঞানের অগ্রগতির দিকে তিনি নজর রেখেছিলেন।"
}

Data Splits

Dataset with train-dev-test example counts are given below: Language | ISO 639-1 Code | Train | Validation | Test | -------------- | ---------------- | ------- | ----- | ------ | Bengali | bn | 419, 967 | 233, 31 | 233, 32 |

Filtering Pipeline

The following filtering pipeline was used to preprocess the raw dataset to ensure high quality. filter_pipeline

| Filter Name | Significance | Filtering Parameters | | ----------- | ----------- |----------------------------| | PINC | Ensure diversity in generated paraphrase | 0.65, 0.76, 0.80| | BERTScore | Preserve semantic coherence with the source |lower 0.91 - 0.93, upper 0.98| |N-gram repetition|Reduce n-gram repetition during inference|2 - 4 grams| | Punctuation | Prevent generating non-terminating sentences during inference | N/A |

In the respective folders, instructions on how to run certain filtering and scoring scripts are provided.

Run the full pipeline

Install requirements from requirements and then run the following command.

bash filter.sh -i <input> -p <pinc_threshold> -l <lower_bert_score_threshold> -h <higher_bert_score_threshold>

Where input is the path to the jsonl file containing sentences and their corresponding paraphrases as key value pairs, pinc_threshold is the threshold for PINCScore, lower_bert_score_threshold and higher_bert_score_threshold are the limits for BERTScore in scale of 0 to 1.

This will generate two files named source.bn and target.bn in the working directory containing the filtered pairs after passing through all the filtering steps.

Training & Evaluation

For training and evaluation, please refer to the repository of BanglaNLG.

Models

The model checkpoint from the paper is available at huggingface model hub.

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

Citation

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}

Related Skills

node-connect

335.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

335.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.5k

Commit, push, and open a PR

csebuetnlp

View profile

View on GitHub

GitHub Stars15

CategoryDevelopment

Updated11mo ago

Forks1

csebuetnlp/banglaparaphrase

Languages

Python

Security Score

72/100

Audited on Apr 21, 2025

No findings