SkillAgentSearch skills...

Radseq

Collection of Python scripts for parsing/analyses of RAD-seq data

Install / Use

/learn @pimbongaerts/Radseq
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

RAD-seq script library

Collection of Python scripts for parsing/analysis of reduced representation sequencing data (e.g. RAD-seq, nextRAD). While many of the scripts are functional, some still need considerable cleaning up and more thorough testing - and this repository therefore very much represents a work in progress.

These scripts all require Python 3, with some requiring additional packages (BioPython and NumPy - both of which can be easily installed using the Miniconda or Anaconda installers, or PyVCF - which can be installed using e.g. pip install PyVCF). Usage information for each script can be obtained using the -h or --help flag (e.g. python3 name_of_script.py -h, or is also listed in this README.

This documentation is dynamically generated using the listed README_compile.py script, extracting purpose, usage and links to example files from the argparse information of each script.

Recently added

vcf_remap2genome.py - script to remap VCF from de novo RAD assembly back to a reference genome

pyrad_find_caps_markers.py - search PyRAD output file for diagnostic CAPS loci that can distinguish two groups (or one group and all other samples)

vcf_clone_detect.py - script to facilitate identification of clones in dataset

vcf

vcf_remap.py - Remaps variants in VCF format to new CHROM and POS as obtained through the mapping_get_bwa_matches.py scripts. Positions are rough estimates because: (1) new position is simply an offset of the mapping position + 0-based position in locus (and e.g. do not take into account reference insertions), (2) one standard contig length is used to determine pos in reverse mapping reads (flag 16). [File did not pass PEP8 check]

usage: vcf_remap.py [-h] vcf_file mapping_file locus_length

positional arguments:
  vcf_file      vcf input file
  mapping_file  file with mapping results
  locus_length  length of query loci

optional arguments:
  -h, --help    show this help message and exit

Example input file(s): vcf_file.vcf.

vcf_missing_data.py - Outputs list of missing data (# and % of SNPs) for each sample in VCF, to identify poor-performing samples to eliminate prior to SNP filtering. Takes vcf_filename as argument. Outputs to STDOUT (no output file). [File did not pass PEP8 check]

usage: vcf_missing_data.py [-h] vcf_file

positional arguments:
  vcf_file    input file with SNP data (`.vcf`)

optional arguments:
  -h, --help  show this help message and exit

Example input file(s): vcf_file.vcf.

vcf_rename_loci.py - Renames CHROMS in .vcf file according to list with old/new names, and only outputs those loci that are listed. [File did not pass PEP8 check]

usage: vcf_rename_loci.py [-h] vcf_file locusnames_file

positional arguments:
  vcf_file         input file with SNP data (`.vcf`)
  locusnames_file  text file (tsv or csv) with old and new name for each locus
                   (/CHROM)

optional arguments:
  -h, --help       show this help message and exit

Example input file(s): vcf_file.vcf, locusnames_file.txt.

vcf_find_clones.py - Script compares the allelic similarity of individuals in a VCF, and outputs all pairwise comparisons. This can be used to detect potential clones based on percentage match. Note: highest matches can be assessed in the output file by using $ sort -rn --key=5 output_file.txt | head -n 50 in the terminal. [File did not pass PEP8 check]

usage: vcf_find_clones.py [-h] vcf_file

positional arguments:
  vcf_file    input file with SNP data (`.vcf`)

optional arguments:
  -h, --help  show this help message and exit

Example input file(s): vcf_file.vcf.

vcf_get_chrom_pos_from_number.py - Translates sequential marker numbers back to CHROM/POS from original .vcf file. Several programs only allow for integers to identify markers, this script is to restore the original CHROM/POS for markers that were identified. [File did not pass PEP8 check]

usage: vcf_get_chrom_pos_from_number.py [-h] vcf_file markernumbers_file

positional arguments:
  vcf_file            input file with SNP data (`.vcf`)
  markernumbers_file  text file with SNP numbers that were identified

optional arguments:
  -h, --help          show this help message and exit

Example input file(s): vcf_file.vcf, markernumbers_file.txt.

vcf_spider.py - Wrapper for PGDspider on Mac OS to convert .vcf files to various formats. Note : set PGDSPIDER_PATH constant before use, and make script executable in terminal with $ chmod +x vcf_spider.py.

usage: vcf_spider.py [-h] vcf_filename pop_filename output_filename

positional arguments:
  vcf_filename     original vcf file
  pop_filename     pop filename (.txt)
  output_filename  output filename (extension used to determine file format
                   (.genepop, .bayescan, .structure or .arlequin)

optional arguments:
  -h, --help       show this help message and exit

vcf_clone_detect.py - Attempts to identify groups of clones in a dataset. The script (1) conducts pairwise comparisons (allelic similarity) for all individuals in a .vcf file, (2) produces a histogram of genetic similarities, (3) lists the highest matches to assess for a potential clonal threshold, (4) clusters the groups of clones based on a particular threshold (supplied or roughly inferred), and (5) lists the clonal individuals that can be removed from the dataset (so that one individual with the least amount of missing data remains). If optional popfile is given, then clonal groups are sorted by population. Note: Firstly, the script is run with a .vcf file and an optional popfile to produce an output file (e.g. python3 vcf_clone_detect.py.py --vcf vcf_file.vcf --pop pop_file.txt --output compare_file.csv). Secondly, it can be rerun using the precalculated similarities under different thresholds (e.g. python3 vcf_clone_detect.py.py --input compare_file.csv --threshold 94.5) [File did not pass PEP8 check]

usage: vcf_clone_detect.py [-h] [-v vcf_file] [-p pop_file] [-i compare_file]
                       [-o compare_file] [-t threshold]

optional arguments:
  -h, --help            show this help message and exit
  -v vcf_file, --vcf vcf_file
                        input file with SNP data (`.vcf`)
  -p pop_file, --pop pop_file
                        text file (tsv or csv) with individuals and
                        populations (to accompany `.vcf` file)
  -i compare_file, --input compare_file
                        input file (csv) with previously calculated pairwise
                        comparisons (using the `--outputfile` option)
  -o compare_file, --output compare_file
                        output file (csv) for all pairwise comparisons (can
                        later be used as input with `--inputfile`)
  -t threshold, --threshold threshold
                        manual similarity threshold (e.g. `94.5` means at
                        least 94.5 percent allelic similarity for individuals
                        to be considered clones)

vcf_minrep_filter_abs.py - Filters .vcf file for SNPs that are genotyped for a minimum number of individuals in each of the populations (rather than overall proportion of individuals). This can help to guarantee a minimum number of individuals to calculate population-based statistics, and eliminate loci that might be suffering from locus drop-out in particular populations. Note: only individuals that are listed in popfile are taken into account to determine number of individuals genotyped (but all indivs are outputted). [File did not pass PEP8 check]

usage: vcf_minrep_filter_abs.py [-h]
                            vcf_file pop_file min_proportion
                            output_filename

positional arguments:
  vcf_file         input file with SNP data (`.vcf`)
  pop_file         text file (tsv or csv) with individuals and populations
  min_proportion   proportion of individuals required to be genotyped in each
                   population for a SNP to be included (e.g `0.8` for 80
                   percent of individuals)
  output_filename  name of output file (`.vcf`)

optional arguments:
  -h, --help       show this help message and exit

Example input file(s): vcf_file.vcf, pop_file.txt.

vcf_minrep_filter.py - Filters .vcf file for SNPs that are genotyped for a minimum proportion of individuals in each of the populations (rather than overall proportion of individuals). This can help to guarantee a minimum number of individuals to calculate population-based statistics, and eliminate loci that might be suffering from locus drop-out in particular populations. Note: only individuals that are listed in popfile are taken into account to determine proportion of individuals genotyped (but all indivs are outputted). [File did not pass PEP8 check]

usage: vcf_minrep_fi

Related Skills

View on GitHub
GitHub Stars25
CategoryDevelopment
Updated12d ago
Forks12

Languages

Python

Security Score

80/100

Audited on Mar 20, 2026

No findings