<p align="center"> <img src="https://bibexpy.com/bibexpy_logo.webp" alt="BibexPy" width="250"/> </p> <h3 align="center">Harmonizing the Bibliometric Symphony of Scopus and Web of Science</h3> <p align="center"> <a href="https://www.python.org"> <img src="https://img.shields.io/badge/Python-≥3.9-blue.svg?logo=python&logoColor=white" alt="Python"/> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/License-GPL-green.svg" alt="License"/> </a> <a href="http://bibexpy.com/doc"> <img src="https://img.shields.io/badge/docs-latest-brightgreen.svg" alt="Documentation"/> </a> <a href="https://github.com/bcankara/BibexPy/issues"> <img src="https://img.shields.io/github/issues/bcankara/BibexPy.svg" alt="GitHub Issues"/> </a> <a href="https://github.com/bcankara/BibexPy/releases"> <img src="https://img.shields.io/github/downloads/bcankara/BibexPy/total.svg" alt="Downloads"/> </a> </p> <p align="center"> <a href="http://bibexpy.com/doc">Documentation</a> • <a href="#installation">Installation</a> • <a href="#features">Features</a> • <a href="#usage">Usage</a> • <a href="#support-and-community">Support</a> </p>

Google Colab Run

Academic Citation

We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:

APA Citation Format

Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098

BibTeX Citation Format

@article{bibexpy2025,
    title     = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
    author    = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
    journal   = {SoftwareX},
    volume    = {30},
    pages     = {102098},
    year      = {2025},
    issn      = {2352-7110},
    publisher = {Elsevier},
    doi       = {10.1016/j.softx.2025.102098},
    url       = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
    keywords  = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}

IEEE Citation Format

B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.

Chicago Citation Format

Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Tech Stack

Features

DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
Enhanced Metadata Enrichment:
- API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
- Machine Learning Enrichment (Experimental):
  - Currently supports prediction for:
    - Keywords (DE field)
    - Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
    - Subject Categories (SC field)
    - Web of Science Categories (WC field)
  - Shows training data statistics for each field
  - Displays progress during model training
  - Provides enrichment results summary
  - Saves detailed statistics to Excel file
- Combined API + ML Enrichment:
  - Sequential processing combining both methods
  - API enrichment performed first with user confirmation
  - ML enrichment applied to API-enriched data
  - Comprehensive statistics for both processes
  - User confirmation at each step
  - Automatic cleanup of temporary files
  - Detailed statistics saved to Excel files
Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
Comprehensive Data Processing: Handles multiple data sources and formats efficiently.

Key Benefits

Time Saving: Automates manual data cleaning and enrichment tasks
Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
Flexible Integration: Works with multiple data sources and output formats
Rich Metadata: Comprehensive metadata enrichment from multiple sources
Smart Enrichment: Choose between API-based or ML-based enrichment methods
Detailed Feedback: Clear statistics and progress indicators during processing
Easy to Use: Simple command-line interface with clear instructions

Prerequisites

Required Python Version

Python ≥ 3.9.0

Required Libraries

# Core Libraries - Required for Basic Functionality
pandas>=2.0.0          # Data manipulation and analysis
numpy>=1.24.0          # Required by pandas for numerical operations
openpyxl>=3.1.2        # Excel file handling

# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0    # ML-based metadata enrichment and predictions
nltk>=3.8.1            # Text processing and feature extraction

# API and Network Libraries - Required for API Enrichment
requests>=2.31.0       # API interactions for metadata enrichment
urllib3>=2.0.0         # HTTP client for Python, used by requests
certifi>=2023.5.7      # Required for SSL certificate verification
python-dotenv>=1.0.0   # API configuration management

# Progress and User Interface
tqdm>=4.65.0          # Progress tracking for long operations
colorama>=0.4.6        # Console output formatting and colors

# Utilities
unidecode==1.3.6       # Text normalization and cleaning
typing-extensions>=4.7.0  # Type hints support

Installation

Clone the Repository

git clone https://github.com/bcankara/BibexPy.git

Navigate to the Directory
```
cd BibexPy
```
Install Dependencies
```
pip install -r requirements.txt
```

(Optional) Virtual Environment Setup

python -m venv venv
source venv/bin/activate  # Mac/Linux
venv\Scripts\activate     # Windows

Usage

Basic Usage
```
python DataProcessor.py
```
- Select your project
- Upload Scopus (.csv) and Web of Science (.txt) files
- Choose processing options
Metadata Enrichment Options

The application offers three main methods for enriching your bibliometric data:

A. API-Based Enrichment
- Provides detailed statistics about empty fields
- Shows which APIs support each field
- Displays percentage of empty records for each field
- Supports multiple APIs:
  - CrossRef (Free)
  - OpenAlex (Free)
  - DataCite (Free)
  - Europe PMC (Free)
  - Scopus (API key required)
  - Semantic Scholar (Optional API key)
  - Unpaywall (Email required)
B. Machine Learning Enrichment (Experimental)
- Currently supports prediction for:
  - Keywords (DE field)
  - Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
  - Subject Categories (SC field)
  - Web of Science Categories (WC field)
- Shows training data statistics for each field
- Displays progress during model training
- Provides enrichment results summary
- Saves detailed statistics to Excel file
C. Combined API + ML Enrichment
- Sequential processing combining both methods
- API enrichment performed first with user confirmation
- ML enrichment applied to API-enriched data
- Comprehensive statistics for both processes
- User confirmation at each step
- Automatic cleanup of temporary files
- Detailed statistics saved to Excel files

API Configuration

For API-based enrichment, configure your APIs in API_config.json:

{
    "scopus": {
        "api_key": "YOUR-SCOPUS-API-KEY",
        "description": "Get your API key from https://dev.elsevier

BibexPy

Install / Use

README