DataFishing
DataFishing is a Python tool that automates searches in genomic databases for biodiversity research. It's faster and more efficient than R packages, streamlining the retrieval of DNA sequences, common names, synonyms, conservation status, and species occurrence data.
Install / Use
/learn @luanrabelo/DataFishingREADME
Contents Overview
- System Overview
- How to cite dataFishing
- License
- The Hitchhiker's Guide to dataFishing
- dataFishing Development Team
- Contact
System Overview
:rocket: Go to Contents Overview
<p align="center"> <img src="https://raw.githubusercontent.com/luanrabelo/dataFishing/stable/docs/assets/dataFishing.png" alt="dataFishing Logo" width="15%"> </p>dataFishing is an efficient Python tool and user-friendly web-form for mining Mitochondrial/Chloroplast Sequences and biodiversity data. It is designed to facilitate and automate access to information from various databases, including NCBI GenBank, Bold Systems, GBIF, WoRMS, IUCN Red List, and Eschmeyer's Catalog of Fishes. dataFishing is faster and more efficient than other tools for obtaining taxonomic information from the databases consulted. It also allows the retrieval of DNA sequences, Common Names, Synonyms, Conservation Status, and Occurrence Points of species. The dataFishing repository, hosted on GitHub and licensed under MIT, is a freely accessible resource for the scientific community.
Key Features
🌍 Multiple Database Support: Access 6 major biodiversity databases
🧬 Sequence Download: Automated download of mitochondrial and chloroplast sequences
📊 Performance Benchmarking: Built-in performance analysis and visualization
⚡ Asynchronous Processing: High-speed concurrent API requests
📋 Comprehensive Results: Excel, CSV, and TSV output formats
🔧 Easy Configuration: Simple command-line interface with helpful documentation
How to cite dataFishing
:rocket: Go to Contents Overview
When referencing the dataFishing tool, please cite it appropriately in your academic or professional work:
Rabelo, L., Sodré, D., Balcázar, O. D. A., do Rosário, M. F., Guimarães-Costa, A. J., Gomes, G., Sampaio, I., & Vallinoto, M. (2025). dataFishing: An efficient Python tool and user-friendly web-form for mining mitochondrial and chloroplast sequences, taxonomic, and biodiversity data. Ecological Informatics, 85, 102970. https://doi.org/10.1016/j.ecoinf.2024.102970
License
dataFishing is released under the MIT License. This license permits reuse within proprietary software provided that all copies of the licensed software include a copy of the MIT License terms and the copyright notice.
For more details, please see the MIT License.
The Hitchhiker's Guide to dataFishing
Change Log
:rocket: Go to Contents Overview
-
Version 1.6.1 (2025-01-30)
- Added asynchronous processing with aiohttp for improved performance
- Implemented comprehensive IUCN Red List data extraction
- Added performance benchmarking and visualization
- Enhanced command-line interface with better argument descriptions
- Added API key configuration system
- Improved error handling and logging
- Added support for Eschmeyer's Catalog of Fishes
-
Version 1.0.1 (2024-10-15)
- Added the ability to download sequence data from BOLD System and/or GenBank
- Added the ability to obtain data of Threats from the IUCN database
-
Version 1.0.0 (2024-10-01)
- Initial release of dataFishing
Getting Started
:rocket: Go to Contents Overview
Prerequisites
Before you run dataFishing, make sure you have the following prerequisites installed:
Python Environment
- Python version 3.8 or higher
- pip (Python package installer)
- conda (optional but recommended)
System Requirements
- Internet connection for API access
- Minimum 4GB RAM (8GB recommended for large datasets)
- 1GB free disk space for results and sequences
Installation
:rocket: Go to Contents Overview
Option 1: Install from PyPI (Recommended)
pip install dataFishing
Option 2: Install from Source
git clone https://github.com/luanrabelo/dataFishing.git
cd dataFishing
pip install -r requirements.txt
pip install -e .
Option 3: Using Conda Environment
conda create -n dataFishing python=3.11
conda activate dataFishing
pip install dataFishing
API Keys Configuration
:rocket: Go to Contents Overview
Some databases require API keys for access. Create an apikeys.env file in your working directory:
# Create apikeys.env file
touch apikeys.env
Add your API keys to the file:
# NCBI Configuration (Required for NCBI database)
NCBI_EMAIL=your-email@university.edu
NCBI_API_KEY=your-ncbi-api-key-here
# IUCN Configuration (Required for IUCN database)
IUCN_API_KEY=your-iucn-api-token-here
How to Obtain API Keys:
NCBI GenBank:
- Register at: https://account.ncbi.nlm.nih.gov/signup/
- Email is required, API key is optional but increases rate limits
- Get API key at: https://www.ncbi.nlm.nih.gov/account/settings/
IUCN Red List:
- Request token at: https://api.iucnredlist.org/
- Academic use is usually free
- Commercial use requires subscription
Other databases (GBIF, WoRMS, BOLD, Eschmeyer) do not require API keys
Usage
:rocket: Go to Contents Overview
Basic syntax:
dataFishing --input SPECIES_FILE --output RESULTS_DIR [OPTIONS]
Command Line Arguments
:rocket: Go to Contents Overview
📁 Input and Output Arguments
--input, -i PATH(required): Path to species list file (.txt or .tsv)--output, -o PATH(required): Output directory for results
🌍 Biodiversity Databases Arguments
--all: Query all available databases--iucn: Query IUCN Red List (requires API key)--ncbi: Query NCBI GenBank (requires email)--bold: Query BOLD Systems--gbif: Query GBIF--worms: Query WoRMS--eschmeyer: Query Eschmeyer's Catalog
🧬 NCBI GenBank Arguments
--email, -e EMAIL: Email address for NCBI access (required for NCBI)--ncbi-api-key KEY: NCBI API key for higher rate limits
⬇️ Sequence Download Arguments
--download-sequences: Enable sequence download--genes-list FILE: File containing gene names (one per line)
📊 Performance and Logging Arguments
--benchmark: Enable performance benchmarking--plot-benchmark TSV_FILE: Generate plots from benchmark data--verbose, -v: Enable detailed logging--log-file: Save logs to files
🔧 API Configuration Arguments
--max-concurrent N: Maximum concurrent requests--rate-limit SECONDS: Delay between requests
Examples
:rocket: Go to Contents Overview
Basic Usage - All Databases
dataFishing --input species.txt --output results/ --all --email your@email.com
Specific Databases Only
dataFishing --input species.txt --output results/ --iucn --worms --gbif --verbose
Download Sequences from NCBI
dataFishing --input species.txt --output results/ --ncbi \
--email your@email.com --download-sequences --genes-list genes.txt
Enable Performance Benchmarking
dataFishing --input species.txt --output results/ --all \
--email your@email.com --benchmark --verbose
Generate Plots from Existing Benchmark
dataFishing --plot-benchmark results/benchmark_results.tsv
Input File Formats
:rocket: Go to Contents Overview
Text File (.txt)
Panthera tigris
Canis lupus
Ursus americanus
Ailuropoda melanoleuca
TSV File from BOLD Systems
Download TSV data from BOLD Systems:
- Search for your taxonomic group
- Click "Combined: TSV" to download
Gene List File Example
COI
COII
COIII
ND5
CYTB
Control Region
16S
12S
Supported Genes
:rocket: Go to Contents Overview
| Category | Mitochondrial Genes | Chloroplast Genes | |----------|---------------------|-------------------| | rRNA | 12S, 16S | - | | Complex I | ND1, ND2, ND3, ND4, ND4L, ND5, ND6 | - | | Complex III | CYTB | - | | Complex IV | COI, COII, COIII | - | | Complex V | ATP6, ATP8 | - | | Control Region | Control Region | - | | ATP Synthase | - | atpA, atpB, atpE, atpF, atpH, atpI | | Cytochrome | - | petA, petB, petD, petE, petG, petL, petN |
Related Skills
feishu-drive
342.0k|
things-mac
342.0kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
342.0kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.1kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
