DataFishing

DataFishing is a Python tool that automates searches in genomic databases for biodiversity research. It's faster and more efficient than R packages, streamlining the retrieval of DNA sequences, common names, synonyms, conservation status, and species occurrence data.

Generate Convert Improve

Install / Use

/learn @luanrabelo/DataFishing

About this skill

Quality Score

0/100

README

Contents Overview

System Overview
How to cite dataFishing
License
- The Hitchhiker's Guide to dataFishing
  - Change Log
  - Getting Started
dataFishing Development Team
Contact

System Overview

:rocket: Go to Contents Overview

dataFishing is an efficient Python tool and user-friendly web-form for mining Mitochondrial/Chloroplast Sequences and biodiversity data. It is designed to facilitate and automate access to information from various databases, including NCBI GenBank, Bold Systems, GBIF, WoRMS, IUCN Red List, and Eschmeyer's Catalog of Fishes. dataFishing is faster and more efficient than other tools for obtaining taxonomic information from the databases consulted. It also allows the retrieval of DNA sequences, Common Names, Synonyms, Conservation Status, and Occurrence Points of species. The dataFishing repository, hosted on GitHub and licensed under MIT, is a freely accessible resource for the scientific community.

Key Features

🌍 Multiple Database Support: Access 6 major biodiversity databases
🧬 Sequence Download: Automated download of mitochondrial and chloroplast sequences
📊 Performance Benchmarking: Built-in performance analysis and visualization
⚡ Asynchronous Processing: High-speed concurrent API requests
📋 Comprehensive Results: Excel, CSV, and TSV output formats
🔧 Easy Configuration: Simple command-line interface with helpful documentation

How to cite dataFishing

:rocket: Go to Contents Overview

When referencing the dataFishing tool, please cite it appropriately in your academic or professional work:

Rabelo, L., Sodré, D., Balcázar, O. D. A., do Rosário, M. F., Guimarães-Costa, A. J., Gomes, G., Sampaio, I., & Vallinoto, M. (2025). dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data. Ecological Informatics, 85, 102970. https://doi.org/10.1016/j.ecoinf.2024.102970

License

dataFishing is released under the MIT License. This license permits reuse within proprietary software provided that all copies of the licensed software include a copy of the MIT License terms and the copyright notice.

For more details, please see the MIT License.

The Hitchhiker's Guide to dataFishing

Change Log

:rocket: Go to Contents Overview

Version 1.6.1 (2025-01-30)
- Added asynchronous processing with aiohttp for improved performance
- Implemented comprehensive IUCN Red List data extraction
- Added performance benchmarking and visualization
- Enhanced command-line interface with better argument descriptions
- Added API key configuration system
- Improved error handling and logging
- Added support for Eschmeyer's Catalog of Fishes
Version 1.0.1 (2024-10-15)
- Added the ability to download sequence data from BOLD System and/or GenBank
- Added the ability to obtain data of Threats from the IUCN database
Version 1.0.0 (2024-10-01)
- Initial release of dataFishing

Getting Started

:rocket: Go to Contents Overview

Prerequisites

Before you run dataFishing, make sure you have the following prerequisites installed:

Python Environment

Python version 3.8 or higher
pip (Python package installer)
conda (optional but recommended)

System Requirements

Internet connection for API access
Minimum 4GB RAM (8GB recommended for large datasets)
1GB free disk space for results and sequences

Installation

:rocket: Go to Contents Overview

Option 1: Install from PyPI (Recommended)

pip install dataFishing

Option 2: Install from Source

git clone https://github.com/luanrabelo/dataFishing.git
cd dataFishing
pip install -r requirements.txt
pip install -e .

Option 3: Using Conda Environment

conda create -n dataFishing python=3.11
conda activate dataFishing
pip install dataFishing

API Keys Configuration

:rocket: Go to Contents Overview

Some databases require API keys for access. Create an apikeys.env file in your working directory:

# Create apikeys.env file
touch apikeys.env

Add your API keys to the file:

# NCBI Configuration (Required for NCBI database)
NCBI_EMAIL=your-email@university.edu
NCBI_API_KEY=your-ncbi-api-key-here

# IUCN Configuration (Required for IUCN database)
IUCN_API_KEY=your-iucn-api-token-here

How to Obtain API Keys:

NCBI GenBank:

Register at: https://account.ncbi.nlm.nih.gov/signup/
Email is required, API key is optional but increases rate limits
Get API key at: https://www.ncbi.nlm.nih.gov/account/settings/

IUCN Red List:

Request token at: https://api.iucnredlist.org/
Academic use is usually free
Commercial use requires subscription

Other databases (GBIF, WoRMS, BOLD, Eschmeyer) do not require API keys

Usage

:rocket: Go to Contents Overview

Basic syntax:

dataFishing --input SPECIES_FILE --output RESULTS_DIR [OPTIONS]

Command Line Arguments

:rocket: Go to Contents Overview

📁 Input and Output Arguments

--input, -i PATH (required): Path to species list file (.txt or .tsv)
--output, -o PATH (required): Output directory for results

🌍 Biodiversity Databases Arguments

--all: Query all available databases
--iucn: Query IUCN Red List (requires API key)
--ncbi: Query NCBI GenBank (requires email)
--bold: Query BOLD Systems
--gbif: Query GBIF
--worms: Query WoRMS
--eschmeyer: Query Eschmeyer's Catalog

🧬 NCBI GenBank Arguments

--email, -e EMAIL: Email address for NCBI access (required for NCBI)
--ncbi-api-key KEY: NCBI API key for higher rate limits

⬇️ Sequence Download Arguments

--download-sequences: Enable sequence download
--genes-list FILE: File containing gene names (one per line)

📊 Performance and Logging Arguments

--benchmark: Enable performance benchmarking
--plot-benchmark TSV_FILE: Generate plots from benchmark data
--verbose, -v: Enable detailed logging
--log-file: Save logs to files

🔧 API Configuration Arguments

--max-concurrent N: Maximum concurrent requests
--rate-limit SECONDS: Delay between requests

Examples

:rocket: Go to Contents Overview

Basic Usage - All Databases

dataFishing --input species.txt --output results/ --all --email your@email.com

Specific Databases Only

dataFishing --input species.txt --output results/ --iucn --worms --gbif --verbose

Download Sequences from NCBI

dataFishing --input species.txt --output results/ --ncbi \
           --email your@email.com --download-sequences --genes-list genes.txt

Enable Performance Benchmarking

dataFishing --input species.txt --output results/ --all \
           --email your@email.com --benchmark --verbose

Generate Plots from Existing Benchmark

dataFishing --plot-benchmark results/benchmark_results.tsv

Input File Formats

:rocket: Go to Contents Overview

Text File (.txt)

Panthera tigris
Canis lupus
Ursus americanus
Ailuropoda melanoleuca

TSV File from BOLD Systems

Download TSV data from BOLD Systems:

Search for your taxonomic group
Click "Combined: TSV" to download

Gene List File Example

COI
COII
COIII
ND5
CYTB
Control Region
16S
12S

Supported Genes

:rocket: Go to Contents Overview

| Category | Mitochondrial Genes | Chloroplast Genes | |----------|---------------------|-------------------| | rRNA | 12S, 16S | - | | Complex I | ND1, ND2, ND3, ND4, ND4L, ND5, ND6 | - | | Complex III | CYTB | - | | Complex IV | COI, COII, COIII | - | | Complex V | ATP6, ATP8 | - | | Control Region | Control Region | - | | ATP Synthase | - | atpA, atpB, atpE, atpF, atpH, atpI | | Cytochrome | - | petA, petB, petD, petE, petG, petL, petN |

Related Skills

feishu-drive

342.0k

things-mac

342.0k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

clawhub

342.0k

Use the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com

codebase-memory-mcp

1.1k

High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.

luanrabelo

View profile

View on GitHub

GitHub Stars6

CategoryData

Updated12d ago

Forks2

luanrabelo/dataFishing

Languages

Python

Security Score

85/100

Audited on Mar 18, 2026

No findings