BibexPy
BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.
Install / Use
/learn @bcankara/BibexPyREADME
Google Colab Run
Academic Citation
We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:
APA Citation Format
Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098
BibTeX Citation Format
@article{bibexpy2025,
title = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
author = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
journal = {SoftwareX},
volume = {30},
pages = {102098},
year = {2025},
issn = {2352-7110},
publisher = {Elsevier},
doi = {10.1016/j.softx.2025.102098},
url = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
keywords = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}
IEEE Citation Format
B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.
Chicago Citation Format
Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.
BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.
Tech Stack
Features
- DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
- Enhanced Metadata Enrichment:
- API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
- Machine Learning Enrichment (Experimental):
- Currently supports prediction for:
- Keywords (DE field)
- Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
- Subject Categories (SC field)
- Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
- Currently supports prediction for:
- Combined API + ML Enrichment:
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files
- Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
- Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
- Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
- Comprehensive Data Processing: Handles multiple data sources and formats efficiently.
Key Benefits
- Time Saving: Automates manual data cleaning and enrichment tasks
- Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
- Flexible Integration: Works with multiple data sources and output formats
- Rich Metadata: Comprehensive metadata enrichment from multiple sources
- Smart Enrichment: Choose between API-based or ML-based enrichment methods
- Detailed Feedback: Clear statistics and progress indicators during processing
- Easy to Use: Simple command-line interface with clear instructions
Prerequisites
Required Python Version
- Python ≥ 3.9.0
Required Libraries
# Core Libraries - Required for Basic Functionality
pandas>=2.0.0 # Data manipulation and analysis
numpy>=1.24.0 # Required by pandas for numerical operations
openpyxl>=3.1.2 # Excel file handling
# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0 # ML-based metadata enrichment and predictions
nltk>=3.8.1 # Text processing and feature extraction
# API and Network Libraries - Required for API Enrichment
requests>=2.31.0 # API interactions for metadata enrichment
urllib3>=2.0.0 # HTTP client for Python, used by requests
certifi>=2023.5.7 # Required for SSL certificate verification
python-dotenv>=1.0.0 # API configuration management
# Progress and User Interface
tqdm>=4.65.0 # Progress tracking for long operations
colorama>=0.4.6 # Console output formatting and colors
# Utilities
unidecode==1.3.6 # Text normalization and cleaning
typing-extensions>=4.7.0 # Type hints support
Installation
-
Clone the Repository
git clone https://github.com/bcankara/BibexPy.git -
Navigate to the Directory
cd BibexPy -
Install Dependencies
pip install -r requirements.txt -
(Optional) Virtual Environment Setup
python -m venv venv source venv/bin/activate # Mac/Linux venv\Scripts\activate # Windows
Usage
-
Basic Usage
python DataProcessor.py- Select your project
- Upload Scopus (
.csv) and Web of Science (.txt) files - Choose processing options
-
Metadata Enrichment Options
The application offers three main methods for enriching your bibliometric data:
A. API-Based Enrichment
- Provides detailed statistics about empty fields
- Shows which APIs support each field
- Displays percentage of empty records for each field
- Supports multiple APIs:
- CrossRef (Free)
- OpenAlex (Free)
- DataCite (Free)
- Europe PMC (Free)
- Scopus (API key required)
- Semantic Scholar (Optional API key)
- Unpaywall (Email required)
B. Machine Learning Enrichment (Experimental)
- Currently supports prediction for:
- Keywords (DE field)
- Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
- Subject Categories (SC field)
- Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
C. Combined API + ML Enrichment
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files
-
API Configuration
For API-based enrichment, configure your APIs in
API_config.json:{ "scopus": { "api_key": "YOUR-SCOPUS-API-KEY", "description": "Get your API key from https://dev.elsevier
