============================= SangerMutantLibraryAnalysis

Summary

This is a Python_ script for analyzing the distribution of mutations among Sanger-sequenced clones in a mutant library of a protein-coding gene. The distributions of mutations are analyzed and plotted.

The script for this analysis is available on GitHub_.

This script was written by Jesse Bloom_.

Analyses

Here are the analyses (each in its own subdirectory) performed with this script:

2013Analysis_Influenza_NP_Aichi68 contains the results of analyzing a mutant library of the A/Aichi/2/1968 (H3N2) influenza gene. Analysis performed by Jesse Bloom_ using SangerMutantLibraryAnalysis v0.21_.
2014Analysis_Influenza_HA_WSN contains the results of analyzing a mutant library of the A/WSN/1933 (H1N1) influenza gene. Analysis performed by Jesse Bloom_ using SangerMutantLibraryAnalysis v0.2_.
2015Analysis_Influenza_HA_WSN_lowermutationrate contains the results of analyzing additional mutant libraries of the A/WSN/1933 (H1N1) influenza gene after only a single round of mutagenesis to attain a lower overall mutation rate. These analysis are also broken up by replicate library and by cloning vector and the commands used to run the analysis using new command-line argument options are in the file Example_commands.

Requirements

This analysis simply consists of a Python_ script. It has been tested with Python_ versions 2.6 and 2.7, and probably works with other 2.* versions as well.

The script requires scipy_ and matplotlib. It has been tested with scipy 0.12.0 and matplotlib_ version 1.2.1, but will probably work with other versions as well.

The script uses ImageMagick convert_ to convert *.pdf files to *.jpg files.

Running the script

The analysis is performed by the script analyze_library.py. To run the script, simply go to the current directory and type the command::

python analyze_library.py

The script will then ask you to enter the names of two input files: the sequence file, and the mutation list file. These are both text files that should have the following format:

* **The sequence file:** this is simply a FASTA file that contains a single protein-coding gene. This should be the gene that you are sequencing. For example, here is an example of such a file (``WSN-HA.fasta``)::

    >WSN-HA
    ATGAAGGCAAAACTACTGGTCCTGTTATATGCATTTGTAGCTACAGATGCAGACACAATATGTATAGGCTACCATGCGAACAACTCAACCGACACTGTTGACACAATACTCGAGAAGAATGTGGCAGTGACACATTCTGTTAACCTGCTCGAAGACAGCCACAACGGGAAACTATGTAAATTAAAAGGAATAGCCCCACTACAATTGGGGAAATGTAACATCACCGGATGGCTCTTGGGAAATCCAGAATGCGACTCACTGCTTCCAGCGAGATCATGGTCCTACATTGTAGAAACACCAAACTCTGAGAATGGAGCATGTTATCCAGGAGATCTCATCGACTATGAGGAACTGAGGGAGCAATTGAGCTCAGTATCATCATTAGAAAGATTCGAAATATTTCCCAAGGAAAGTTCATGGCCCAACCACACATTCAACGGAGTAACAGTATCATGCTCCCATAGGGGAAAAAGCAGTTTTTACAGAAATTTGCTATGGCTGACGAAGAAGGGGGATTCATACCCAAAGCTGACCAATTCCTATGTGAACAATAAAGGGAAAGAAGTCCTTGTACTATGGGGTGTTCATCACCCGTCTAGCAGTGATGAGCAACAGAGTCTCTATAGTAATGGAAATGCTTATGTCTCTGTAGCGTCTTCAAATTATAACAGGAGATTCACCCCGGAAATAGCTGCAAGGCCCAAAGTAAGAGATCAACATGGGAGGATGAACTATTACTGGACCTTGCTAGAACCCGGAGACACAATAATATTTGAGGCAACTGGTAATCTAATAGCACCATGGTATGCTTTCGCACTGAGTAGAGGGTTTGAGTCCGGCATCATCACCTCAAACGCGTCAATGCATGAGTGTAACACGAAGTGTCAAACACCCCAGGGAGCTATAAACAGCAATCTCCCTTTCCAGAATATACACCCAGTCACAATAGGAGAGTGCCCAAAATATGTCAGGAGTACCAAATTGAGGATGGTTACAGGACTAAGAAACATCCCATCCATTCAATACAGAGGTCTATTTGGAGCCATTGCTGGTTTTATTGAGGGGGGATGGACTGGAATGATAGATGGATGGTATGGTTATCATCATCAGAATGAACAGGGATCAGGCTATGCAGCGGATCAAAAAAGCACACAAAATGCCATTAACGGGATTACAAACAAGGTGAACTCTGTTATCGAGAAAATGAACACTCAATTCACAGCTGTGGGTAAAGAATTCAACAACTTAGAAAAAAGGATGGAAAATTTAAATAAAAAAGTTGATGATGGGTTTCTGGACATTTGGACATATAATGCAGAATTGTTAGTTCTACTGGAAAATGAAAGGACTTTGGATTTCCATGACTTAAATGTGAAGAATCTGTACGAGAAAGTAAAAAGCCAATTAAAGAATAATGCCAAAGAAATCGGAAATGGGTGTTTTGAGTTCTACCACAAGTGTGACAATGAATGCATGGAAAGTGTAAGAAATGGGACTTATGATTATCCAAAATATTCAGAAGAATCAAAGTTGAACAGGGAAAAGATAGATGGAGTGAAATTGGAATCAATGGGGGTGTATCAGATTCTGGCGATCTACTCAACTGTCGCCAGTTCACTGGTGCTTTTGGTCTCCCTGGGGGCAATCAGTTTCTGGATGTGTTCTAATGGGTCTTTGCAGTGCAGAATATGCATCTGA 

* **The mutation list file:** this is a text file that lists the mutations. The mutations should be numbered in sequential (1, 2, ...) numbering according to the sequence specified in the sequence file. Lines in this file that are empty or begin with the character # are ignored. All other lines should specify a clone and all identified mutations. The clone name should be the first entry on the line, followed by a colon. There is then a comma-delimited list of the mutations. The mutations are indicated as follows:

    * Single nucleotide substitutions are indicated as *G1378T* for mutation of site 1378 from *G* to *T*.

    * Multiple-nucleotide mutations at the same codon should be listed as *AG349GA* or *TCT1003GCC*. List mutations like this if they are sequential (or two mutations separated by a single other mutation) as these are probably mutations of the same codon.

    * Deletions should be listed like this: *delC392* for deletion of the *C* at position 392.

    * Insertions should be listed like this: *insG392* for insertion of a *G* at position 392.

    * If a clone has not mutations, you should enter *None*.

  The script will check that the specified wildtype nucleotides actually match thos e indicated in the sequence file. If they do not, an error will be raised. Note that even if your sequence contains an insertion or deletion, you must ensure that subsequent sites are still numbered according to sequential numbering of the provided sequence.

  Here is an example input file::

    # Sequencing results for 3 WSN-HA mutation libraries. 
    # Results from Aug-28 2013 ans Sept-3 2013.
    # Twelve single colonies analyzed for each library replicate with 3 primers. 
    # Total 36 samples.  

    1-1: AAG406GAT, GA1342AG
    1-2: GTA442TAC, AAC667GCT, AAC853GGA
    1-3: ATG1204GTC
    1-4: G373A, G510T
    1-5: AAT118TTG, GGT784CGC, CA1220AT
    1-6: GCA778TAC, TT1019CG, ATA1081GGG, A1370C
    1-7: AGG670TCC, A1232T, delA1544
    1-8: None

    1-9: AAT535GAG
    1-10: ACA1150TAC, ATT1171CCG, GTT1192CTA
    1-11: C1175A
    1-12: TC566AT, C855T, C912T
    2-1: CTA172AGG, TC1238AT, T1675G
    2-2: GCT1048CAC
    2-3: AGC472TTT, TAT733CTG, G1070A
    #2-4: mixed template, clearly colony is actually two different mutants 

    2-5: TCA256CCC, GA832CG, AGG985CGT
    2-6: T818A, G834A, GG1288TC
    2-7: TTC1237AAT
    2-8: GA307CG, T1086A, AG1199GC
    2-9: GGC67CAG, insG1121, AGG1537TAC
    2-10: GG238AC, TGT319CAA, AAG406GAC, C1136A, AA1337TG, GT1621TC
    2-11: A64G, GGG208AAT, ATG727TCA, GG986CC
    2-12: TGT871AAA, TGT1660GAG, TG1690AT
    3-1: G207T

    3-2: None
    3-3: ACA1150TAC

    3-4: AAC436TAT, T964G, TCT1189GTA
    3-5: GA272TC, TCC1015CAG

    3-6: TCA376ACT, T1074G, A1283T
    3-7: G67C
    3-8: GAT514TAG

    #3-9: bad sequencing read with forward primer
    3-10: AG625CT, AGC910TTT
    3-11: T606G
    3-12: ATT1162CTA

The script will also ask you to enter the codon number of the first codon in the mutagenized region of the gene. This allows specification of an arbitrary start codon for mutagenesis. Codon numbering should follow a (1, 2, ...) numbering scheme, where the first codon in the gene is indexed as codon 1, the second as codon 2, etc. For example, if the entire gene was mutagenized, you should enter 1. However, if the mutagenzied region does not start until codon 25, you should enter 25.

If you don't want to manually type in this information when you run the script, you can provide it on the command line when calling the script as shown below, see Example_commands for examples of how to call the script with all the necessary information provided on the command line.

usage: analyze_library.py [-h] [--outputprefix OUTPUTPREFIX] [--seqfile SEQFILE] [--mfile MFILE] [--mutstart MUTSTART] [--title TITLE]

optional arguments: -h, --help show this help message and exit --outputprefix OUTPUTPREFIX optional prefix for output files generated --seqfile SEQFILE name of FASTA file containing the gene sequence --mfile MFILE name of the file containing the list of mutations --mutstart MUTSTART position of the first codon in the mutated segment of the gene --title TITLE title for plots generated

Output of the script

The script will print some information about the mutation statistics to standard output. It will also create some PDF plot files. For example, running the script using the example sequence file WSN-HA.fasta and the example mutation list wsn_mutations_090413.txt provided with this script on GitHub_ will produce the following information printed to standard output::

Beginning analysis.

Enter the name of the FASTA file containing the gene sequence: WSN-HA.fasta
Read a coding sequence of length 1698

Enter the name of the file containing the list of mutations: wsn_mutations_090413.txt

Enter the position of the first codon in the mutated segment of the gene: 1

Reading mutations from wsn_mutations_090413.txt
Read entries for 34 clones

Substitutions begin at following positions: 22, 23, 23, 40, 58, 69, 70, 80, 86, 91, 103, 107, 125, 126, 136, 136, 146, 148, 158, 170, 172, 179, 189, 202, 209, 223, 224, 243, 245, 260, 262, 273, 278, 278, 285, 285, 291, 304, 304, 322, 329, 329, 339, 340, 350, 357, 358, 361, 362, 379, 384, 384, 388, 391, 392, 397, 398, 400, 402, 407, 411, 413, 413, 428, 430, 446, 448, 457, 513, 541, 554, 559, 564

Indels begin at following positions: 374, 515

Found

SangerMutantLibraryAnalysis

Install / Use

README