DNASequenceAnalysisTool
A comprehensive Python tool for DNA sequence analysis that provides various molecular biology and bioinformatics functions.
Install / Use
/learn @YanCotta/DNASequenceAnalysisToolREADME
🧬 DNA Sequence Analysis Tool
A high-performance Python library and command-line tool for comprehensive DNA/RNA sequence analysis with advanced visualization capabilities. This toolkit is designed for both bioinformaticians and molecular biologists, providing a robust set of tools for sequence analysis, manipulation, and visualization.
📑 Table of Contents
- ✨ Key Features
- 🏗️ Project Structure
- 📋 Requirements
- 🚀 Installation
- ⚡ Quick Start
- 📚 Usage Examples
- ⚙️ Configuration
- 📖 Documentation
- 🤝 Contributing
- 🧪 Testing
- 📄 License
- 📜 Changelog
✨ Key Features
Sequence Analysis
- GC content calculation
- Melting temperature prediction
- Molecular weight calculation
- Sequence validation and sanitization
- Motif finding and pattern matching
- ORF (Open Reading Frame) detection
Sequence Manipulation
- Reverse complement generation
- Transcription and translation
- Sequence alignment
- Primer design
- Restriction site analysis
File I/O Support
- FASTA/FASTQ format support
- GZIP/BZ2 compression support
- Batch processing of multiple files
- Stream processing for large files
- Configurable output formats
- Parallel processing options
Visualization
- GC content plots
- Sequence logos
- Restriction maps
- Interactive sequence viewers
Command-Line Interface
- User-friendly command-line tools
- Batch processing support
- Configurable output formats
- Parallel processing options
- User-friendly command-line tools
- Batch processing support
- Configurable output formats
- Parallel processing options
🏗️ Project Structure
DNASequenceAnalysisTool/
├── dna_sequence_analysis_tool/ # Main package
│ ├── core/ # Core functionality
│ │ ├── __init__.py
│ │ ├── sequence_analysis.py # Sequence analysis functions
│ │ ├── sequence_io.py # File I/O operations
│ │ ├── sequence_validation.py # Sequence validation
│ │ ├── sequence_statistics.py # Statistical analysis
│ │ ├── sequence_transformation.py # Sequence manipulation
│ │ └── visualization.py # Visualization tools
│ ├── data/ # Sample data
│ │ ├── __init__.py
│ │ └── sample_sequence.py # Sample sequences
│ ├── tests/ # Test suite
│ │ ├── __init__.py
│ │ └── test_sequence_analysis.py
│ ├── utils/ # Utility functions
│ │ ├── __init__.py
│ │ └── file_io.py
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration settings
│ ├── exceptions.py # Custom exceptions
│ └── logging_config.py # Logging configuration
├── examples/ # Example scripts
│ ├── basic_sequence_analysis.py
│ ├── file_io_and_visualization.py
│ └── README.md
├── .gitignore
├── CHANGELOG.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── Makefile
├── pyproject.toml
├── requirements-dev.txt
├── requirements.txt
└── setup.py
📋 Requirements
- Python 3.8+
- Dependencies are listed in
requirements.txt
Core Dependencies
- NumPy >= 1.19.0
- SciPy >= 1.5.0
- Biopython >= 1.78
- pandas >= 1.2.0
- pydantic >= 1.8.0
- pyyaml >= 5.4.1
- click >= 8.0.0
- rich >= 10.0.0
- matplotlib >= 3.3.0
- plotly >= 5.0.0
💻 Installation
From PyPI (recommended)
pip install dna-sequence-analysis-tool
From Source
-
Clone the repository:
git clone https://github.com/YanCotta/DNASequenceAnalysisTool.git cd DNASequenceAnalysisTool -
Install with pip in development mode:
pip install -e .
Development Setup
-
Install development dependencies:
pip install -r requirements-dev.txt -
Set up pre-commit hooks:
pre-commit install
🚀 Quick Start
Python API
from dna_sequence_analysis_tool import DNASequence, DNAToolkit
# Create a DNA sequence
sequence = DNASequence("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", "example_sequence")
# Get sequence information
print(f"Sequence ID: {sequence.id}")
print(f"Length: {sequence.length} bp")
print(f"GC content: {sequence.gc_content:.2f}%")
# Get reverse complement
rev_comp = sequence.reverse_complement()
print(f"Reverse complement: {rev_comp}")
# Find motifs
motif = "GGC"
positions = sequence.find_motif(motif)
print(f"Motif '{motif}' found at positions: {positions}")
# Analyze with toolkit
toolkit = DNAToolkit()
tm = toolkit.calculate_melting_temperature(sequence.sequence)
print(f"Melting temperature: {tm:.2f}°C")
Command Line Interface
# Analyze a sequence file
dnatool analyze sequences.fasta --output results.csv
# Generate a GC content plot
dnatool plot-gc sequences.fasta --output gc_plot.png
# Find ORFs in a sequence
dnatool find-orfs sequence.fasta --min-length 100
# Get help
dnatool --help
🔧 Configuration
The tool can be configured using a YAML configuration file located at ~/.dna_sequence_analysis/config.yaml.
Example configuration:
# General settings
log_level: INFO
max_sequence_length: 10000000
# File I/O settings
default_input_format: fasta
default_output_format: fasta
auto_detect_format: true
# Performance settings
chunk_size: 10000
max_workers: 4
# Visualization settings
plot_theme: default
default_figure_size: [10, 6]
📚 Documentation
Comprehensive documentation is available at Read the Docs.
To build the documentation locally:
cd docs
make html
🤝 Contributing
Contributions are welcome! Please see our Contributing Guide for details on how to contribute to this project.
🧪 Testing
Run the test suite with:
pytest
For test coverage report:
pytest --cov=dna_sequence_analysis_tool --cov-report=term-missing
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
📝 Changelog
See CHANGELOG.md for a history of changes to this project.
📬 Contact & Support
For support or questions, please open an issue on GitHub.
Made with ❤️ by the DNA Sequence Analysis Tool contributors
🌟 Features
Sequence Analysis
- GC content calculation
- Melting temperature prediction
- ORF detection and analysis
- Nucleotide composition analysis
- Pattern recognition and motif finding
Molecular Biology Tools
- DNA/RNA transcription
- Codon-optimized protein translation
- Sophisticated ORF detection
- Advanced melting temperature calculations
File I/O Support
- FASTA/FASTQ format support
- GZIP/BZIP2 compression
- Batch processing capabilities
- Format conversion utilities
Visualization
- GC content plots
- Sequence logos
- Multiple sequence alignments
- Interactive visualizations
Command Line Interface
- Intuitive command structure
- Batch processing support
- Multiple output formats (text, JSON, CSV)
- Visualization export to image files
🏗️ Project Structure
dna_sequence_analysis_tool/
├── core/
│ ├── __init__.py
│ ├── sequence_analysis.py
│ ├── sequence_validation.py
│ └── visualization.py
├── data/
│ └── sample_sequences.fasta
├── utils/
│ ├── file_io.py
│ └── logging.py
├── tests/
│ ├── test_sequence_analysis.py
│ └── test_validation.py
├── cli.py
├── README.md
└── requirements.txt
📋 Requirements
- Python 3.8+
Core Dependencies
- NumPy >= 1.19.0
- SciPy >= 1.5.0
- Biopython >= 1.78
- pandas >= 1.2.0
- matplotlib >= 3.3.0 (for visualization)
- click >= 8.0.0 (for CLI)
- rich >= 10.0.0 (for rich CLI output)
- plotly >= 5.0.0 (for interactive visualizations)
Optional Dependencies
- python-magic (for file type detection)
- python-magic-bin (Windows only, for file type detection)
📦 Installation
# Install from PyPI
pip install dna-sequence-analysis-tool
# Install from source
git clone https://github.com/YanCotta/DNASequenceAnalysisTool.git
cd DNASequenceAnalysisTool
pip install -e .
🔍 Quick Start
from dna_sequence_analysis_tool import DNAToolkit
# Initialize toolkit
toolkit = DNAToolkit()
# Analyze a sequence
sequence = "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
result = toolkit.analyze_sequence(sequence)
print(f"GC Content: {result.gc_content}%")
📊 API Documentation
Core Classes
DNASequence
class DNASequence:
"""
Core class for DNA sequence analysis.
Attributes:
sequence (str): The DNA sequence
length (int): Sequence length
gc_content (float): GC content percentage
"""
Basic Functions
validate_sequence(sequence)
- Validates DNA sequences (A, T, G, C)
- Returns: (bool, str) - validity status and error message
calculate_gc_content(dna_sequence)
- Calculates GC content percentage
- Raises ValueError for invalid sequences
reverse_complement(dna_sequence)
- Generates reverse complement of DNA sequence
- Returns: String of complementary sequence
find_motif(dna_sequence, motif)
- Finds all occurrences of a
