SkillAgentSearch skills...

DataFishing

DataFishing is a Python tool that automates searches in genomic databases for biodiversity research. It's faster and more efficient than R packages, streamlining the retrieval of DNA sequences, common names, synonyms, conservation status, and species occurrence data.

Install / Use

/learn @luanrabelo/DataFishing
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="https://raw.githubusercontent.com/luanrabelo/dataFishing/stable/docs/assets/dataFishing.png" alt="dataFishing Logo" width="50%"> </p> <p align="center"> <a href="https://www.buymeacoffee.com/lprabelo" target="_blank"> <img src="https://img.buymeacoffee.com/button-api/?text=Buy me a coffee&emoji=☕&slug=lprabelo&button_colour=959595&font_colour=000000&font_family=Lato&outline_colour=000000&coffee_colour=000000" /> </a> </p>

Published in%20

Contents Overview


System Overview

:rocket: Go to Contents Overview
<p align="center"> <img src="https://raw.githubusercontent.com/luanrabelo/dataFishing/stable/docs/assets/dataFishing.png" alt="dataFishing Logo" width="15%"> </p>

dataFishing is an efficient Python tool and user-friendly web-form for mining Mitochondrial/Chloroplast Sequences and biodiversity data. It is designed to facilitate and automate access to information from various databases, including NCBI GenBank, Bold Systems, GBIF, WoRMS, IUCN Red List, and Eschmeyer's Catalog of Fishes. dataFishing is faster and more efficient than other tools for obtaining taxonomic information from the databases consulted. It also allows the retrieval of DNA sequences, Common Names, Synonyms, Conservation Status, and Occurrence Points of species. The dataFishing repository, hosted on GitHub and licensed under MIT, is a freely accessible resource for the scientific community.

Key Features

🌍 Multiple Database Support: Access 6 major biodiversity databases
🧬 Sequence Download: Automated download of mitochondrial and chloroplast sequences
📊 Performance Benchmarking: Built-in performance analysis and visualization
Asynchronous Processing: High-speed concurrent API requests
📋 Comprehensive Results: Excel, CSV, and TSV output formats
🔧 Easy Configuration: Simple command-line interface with helpful documentation


How to cite dataFishing

:rocket: Go to Contents Overview

When referencing the dataFishing tool, please cite it appropriately in your academic or professional work:

Rabelo, L., Sodré, D., Balcázar, O. D. A., do Rosário, M. F., Guimarães-Costa, A. J., Gomes, G., Sampaio, I., & Vallinoto, M. (2025). dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data. Ecological Informatics, 85, 102970. https://doi.org/10.1016/j.ecoinf.2024.102970

License

dataFishing is released under the MIT License. This license permits reuse within proprietary software provided that all copies of the licensed software include a copy of the MIT License terms and the copyright notice.

For more details, please see the MIT License.


The Hitchhiker's Guide to dataFishing

Change Log

:rocket: Go to Contents Overview
  • Version 1.6.1 (2025-01-30)

    • Added asynchronous processing with aiohttp for improved performance
    • Implemented comprehensive IUCN Red List data extraction
    • Added performance benchmarking and visualization
    • Enhanced command-line interface with better argument descriptions
    • Added API key configuration system
    • Improved error handling and logging
    • Added support for Eschmeyer's Catalog of Fishes
  • Version 1.0.1 (2024-10-15)

    • Added the ability to download sequence data from BOLD System and/or GenBank
    • Added the ability to obtain data of Threats from the IUCN database
  • Version 1.0.0 (2024-10-01)

    • Initial release of dataFishing

Getting Started

:rocket: Go to Contents Overview

Prerequisites

Before you run dataFishing, make sure you have the following prerequisites installed:

Python Environment

  • Python version 3.8 or higher
  • pip (Python package installer)
  • conda (optional but recommended)

System Requirements

  • Internet connection for API access
  • Minimum 4GB RAM (8GB recommended for large datasets)
  • 1GB free disk space for results and sequences

Installation

:rocket: Go to Contents Overview

Option 1: Install from PyPI (Recommended)

pip install dataFishing

Option 2: Install from Source

git clone https://github.com/luanrabelo/dataFishing.git
cd dataFishing
pip install -r requirements.txt
pip install -e .

Option 3: Using Conda Environment

conda create -n dataFishing python=3.11
conda activate dataFishing
pip install dataFishing

API Keys Configuration

:rocket: Go to Contents Overview

Some databases require API keys for access. Create an apikeys.env file in your working directory:

# Create apikeys.env file
touch apikeys.env

Add your API keys to the file:

# NCBI Configuration (Required for NCBI database)
NCBI_EMAIL=your-email@university.edu
NCBI_API_KEY=your-ncbi-api-key-here

# IUCN Configuration (Required for IUCN database)
IUCN_API_KEY=your-iucn-api-token-here

How to Obtain API Keys:

NCBI GenBank:

  1. Register at: https://account.ncbi.nlm.nih.gov/signup/
  2. Email is required, API key is optional but increases rate limits
  3. Get API key at: https://www.ncbi.nlm.nih.gov/account/settings/

IUCN Red List:

  1. Request token at: https://api.iucnredlist.org/
  2. Academic use is usually free
  3. Commercial use requires subscription

Other databases (GBIF, WoRMS, BOLD, Eschmeyer) do not require API keys

Usage

:rocket: Go to Contents Overview

Basic syntax:

dataFishing --input SPECIES_FILE --output RESULTS_DIR [OPTIONS]

Command Line Arguments

:rocket: Go to Contents Overview

📁 Input and Output Arguments

  • --input, -i PATH (required): Path to species list file (.txt or .tsv)
  • --output, -o PATH (required): Output directory for results

🌍 Biodiversity Databases Arguments

  • --all: Query all available databases
  • --iucn: Query IUCN Red List (requires API key)
  • --ncbi: Query NCBI GenBank (requires email)
  • --bold: Query BOLD Systems
  • --gbif: Query GBIF
  • --worms: Query WoRMS
  • --eschmeyer: Query Eschmeyer's Catalog

🧬 NCBI GenBank Arguments

  • --email, -e EMAIL: Email address for NCBI access (required for NCBI)
  • --ncbi-api-key KEY: NCBI API key for higher rate limits

⬇️ Sequence Download Arguments

  • --download-sequences: Enable sequence download
  • --genes-list FILE: File containing gene names (one per line)

📊 Performance and Logging Arguments

  • --benchmark: Enable performance benchmarking
  • --plot-benchmark TSV_FILE: Generate plots from benchmark data
  • --verbose, -v: Enable detailed logging
  • --log-file: Save logs to files

🔧 API Configuration Arguments

  • --max-concurrent N: Maximum concurrent requests
  • --rate-limit SECONDS: Delay between requests

Examples

:rocket: Go to Contents Overview

Basic Usage - All Databases

dataFishing --input species.txt --output results/ --all --email your@email.com

Specific Databases Only

dataFishing --input species.txt --output results/ --iucn --worms --gbif --verbose

Download Sequences from NCBI

dataFishing --input species.txt --output results/ --ncbi \
           --email your@email.com --download-sequences --genes-list genes.txt

Enable Performance Benchmarking

dataFishing --input species.txt --output results/ --all \
           --email your@email.com --benchmark --verbose

Generate Plots from Existing Benchmark

dataFishing --plot-benchmark results/benchmark_results.tsv

Input File Formats

:rocket: Go to Contents Overview

Text File (.txt)

Panthera tigris
Canis lupus
Ursus americanus
Ailuropoda melanoleuca

TSV File from BOLD Systems

Download TSV data from BOLD Systems:

  1. Search for your taxonomic group
  2. Click "Combined: TSV" to download

Gene List File Example

COI
COII
COIII
ND5
CYTB
Control Region
16S
12S

Supported Genes

:rocket: Go to Contents Overview

| Category | Mitochondrial Genes | Chloroplast Genes | |----------|---------------------|-------------------| | rRNA | 12S, 16S | - | | Complex I | ND1, ND2, ND3, ND4, ND4L, ND5, ND6 | - | | Complex III | CYTB | - | | Complex IV | COI, COII, COIII | - | | Complex V | ATP6, ATP8 | - | | Control Region | Control Region | - | | ATP Synthase | - | atpA, atpB, atpE, atpF, atpH, atpI | | Cytochrome | - | petA, petB, petD, petE, petG, petL, petN |

Related Skills

View on GitHub
GitHub Stars6
CategoryData
Updated12d ago
Forks2

Languages

Python

Security Score

85/100

Audited on Mar 18, 2026

No findings