SkillAgentSearch skills...

BibexPy

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Install / Use

/learn @bcankara/BibexPy

README

<p align="center"> <img src="https://bibexpy.com/bibexpy_logo.webp" alt="BibexPy" width="250"/> </p> <h3 align="center">Harmonizing the Bibliometric Symphony of Scopus and Web of Science</h3> <p align="center"> <a href="https://www.python.org"> <img src="https://img.shields.io/badge/Python-≥3.9-blue.svg?logo=python&logoColor=white" alt="Python"/> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/License-GPL-green.svg" alt="License"/> </a> <a href="http://bibexpy.com/doc"> <img src="https://img.shields.io/badge/docs-latest-brightgreen.svg" alt="Documentation"/> </a> <a href="https://github.com/bcankara/BibexPy/issues"> <img src="https://img.shields.io/github/issues/bcankara/BibexPy.svg" alt="GitHub Issues"/> </a> <a href="https://github.com/bcankara/BibexPy/releases"> <img src="https://img.shields.io/github/downloads/bcankara/BibexPy/total.svg" alt="Downloads"/> </a> </p> <p align="center"> <a href="http://bibexpy.com/doc">Documentation</a> • <a href="#installation">Installation</a> • <a href="#features">Features</a> • <a href="#usage">Usage</a> • <a href="#support-and-community">Support</a> </p>

Google Colab Run

Open In Colab

Academic Citation

We appreciate the academic community's interest in BibexPy. If you find our tool useful in your research work, we kindly request that you cite our paper:

DOI ScienceDirect

APA Citation Format

Kara, B. C., Şahin, A., & Dirsehan, T. (2025). BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science. SoftwareX, 30, 102098. https://doi.org/10.1016/j.softx.2025.102098

BibTeX Citation Format

@article{bibexpy2025,
    title     = {BibexPy: Harmonizing the bibliometric symphony of {Scopus} and {Web of Science}},
    author    = {Kara, Burak Can and {\c{S}}ahin, Alperen and Dirsehan, Ta{\c{s}}k{\i}n},
    journal   = {SoftwareX},
    volume    = {30},
    pages     = {102098},
    year      = {2025},
    issn      = {2352-7110},
    publisher = {Elsevier},
    doi       = {10.1016/j.softx.2025.102098},
    url       = {https://www.sciencedirect.com/science/article/pii/S2352711025000652},
    keywords  = {Bibliometric analysis tools, Automated data integration, Metadata enrichment software, Scikit-learn, Machine learning, API-Based metadata processing}
}

IEEE Citation Format

B. C. Kara, A. Şahin and T. Dirsehan, "BibexPy: Harmonizing the bibliometric symphony of Scopus and Web of Science," SoftwareX, vol. 30, p. 102098, 2025, doi: 10.1016/j.softx.2025.102098.

Chicago Citation Format

Kara, Burak Can, Alperen Şahin, and Taşkın Dirsehan. "BibexPy: Harmonizing the Bibliometric Symphony of Scopus and Web of Science." SoftwareX 30 (2025): 102098. https://doi.org/10.1016/j.softx.2025.102098.

BibexPy is a Python-based software designed to streamline bibliometric data integration, deduplication, metadata enrichment, and format conversion. It simplifies the preparation of high-quality datasets for advanced analyses by automating traditionally manual and error-prone tasks.

Tech Stack

Python Pandas NumPy scikit-learn NLTK Excel

Scopus Web of Science VOSviewer

Features

  • DOI-Based Deduplication and Merging: Identifies and removes duplicate entries while enriching metadata by merging complementary records.
  • Enhanced Metadata Enrichment:
    • API-Based Enrichment: Completes missing fields using multiple APIs with detailed field statistics and API support information.
    • Machine Learning Enrichment (Experimental):
      • Currently supports prediction for:
        • Keywords (DE field)
        • Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
        • Subject Categories (SC field)
        • Web of Science Categories (WC field)
      • Shows training data statistics for each field
      • Displays progress during model training
      • Provides enrichment results summary
      • Saves detailed statistics to Excel file
    • Combined API + ML Enrichment:
      • Sequential processing combining both methods
      • API enrichment performed first with user confirmation
      • ML enrichment applied to API-enriched data
      • Comprehensive statistics for both processes
      • User confirmation at each step
      • Automatic cleanup of temporary files
      • Detailed statistics saved to Excel files
  • Flexible Workflow: Choose between API or ML enrichment in any order, with clear progress indicators and statistics.
  • Format Conversion: Generates outputs compatible with VosViewer, Biblioshiny, and other analysis tools.
  • Command-Line Interface (CLI): Offers user-friendly interaction with minimal setup requirements.
  • Comprehensive Data Processing: Handles multiple data sources and formats efficiently.

Key Benefits

  • Time Saving: Automates manual data cleaning and enrichment tasks
  • Enhanced Data Quality: Reduces errors and inconsistencies in bibliometric data
  • Flexible Integration: Works with multiple data sources and output formats
  • Rich Metadata: Comprehensive metadata enrichment from multiple sources
  • Smart Enrichment: Choose between API-based or ML-based enrichment methods
  • Detailed Feedback: Clear statistics and progress indicators during processing
  • Easy to Use: Simple command-line interface with clear instructions

Prerequisites

Required Python Version

  • Python ≥ 3.9.0

Required Libraries

# Core Libraries - Required for Basic Functionality
pandas>=2.0.0          # Data manipulation and analysis
numpy>=1.24.0          # Required by pandas for numerical operations
openpyxl>=3.1.2        # Excel file handling

# Machine Learning - Required for ML Enrichment
scikit-learn>=1.3.0    # ML-based metadata enrichment and predictions
nltk>=3.8.1            # Text processing and feature extraction

# API and Network Libraries - Required for API Enrichment
requests>=2.31.0       # API interactions for metadata enrichment
urllib3>=2.0.0         # HTTP client for Python, used by requests
certifi>=2023.5.7      # Required for SSL certificate verification
python-dotenv>=1.0.0   # API configuration management

# Progress and User Interface
tqdm>=4.65.0          # Progress tracking for long operations
colorama>=0.4.6        # Console output formatting and colors

# Utilities
unidecode==1.3.6       # Text normalization and cleaning
typing-extensions>=4.7.0  # Type hints support

Installation

  1. Clone the Repository

    git clone https://github.com/bcankara/BibexPy.git
    
  2. Navigate to the Directory

    cd BibexPy
    
  3. Install Dependencies

    pip install -r requirements.txt
    
  4. (Optional) Virtual Environment Setup

    python -m venv venv
    source venv/bin/activate  # Mac/Linux
    venv\Scripts\activate     # Windows
    

Usage

  1. Basic Usage

    python DataProcessor.py
    
    • Select your project
    • Upload Scopus (.csv) and Web of Science (.txt) files
    • Choose processing options
  2. Metadata Enrichment Options

    The application offers three main methods for enriching your bibliometric data:

    A. API-Based Enrichment

    • Provides detailed statistics about empty fields
    • Shows which APIs support each field
    • Displays percentage of empty records for each field
    • Supports multiple APIs:
      • CrossRef (Free)
      • OpenAlex (Free)
      • DataCite (Free)
      • Europe PMC (Free)
      • Scopus (API key required)
      • Semantic Scholar (Optional API key)
      • Unpaywall (Email required)

    B. Machine Learning Enrichment (Experimental)

    • Currently supports prediction for:
      • Keywords (DE field)
      • Keywords Plus (ID field) - Independent model using TF-IDF + RandomForest
      • Subject Categories (SC field)
      • Web of Science Categories (WC field)
    • Shows training data statistics for each field
    • Displays progress during model training
    • Provides enrichment results summary
    • Saves detailed statistics to Excel file

    C. Combined API + ML Enrichment

    • Sequential processing combining both methods
    • API enrichment performed first with user confirmation
    • ML enrichment applied to API-enriched data
    • Comprehensive statistics for both processes
    • User confirmation at each step
    • Automatic cleanup of temporary files
    • Detailed statistics saved to Excel files
  3. API Configuration

    For API-based enrichment, configure your APIs in API_config.json:

    {
        "scopus": {
            "api_key": "YOUR-SCOPUS-API-KEY",
            "description": "Get your API key from https://dev.elsevier
    
View on GitHub
GitHub Stars16
CategoryData
Updated2mo ago
Forks3

Languages

Python

Security Score

95/100

Audited on Jan 25, 2026

No findings